Topic Modeling on my Watched Movies

Dimitris Effrosynidis
Analytics Vidhya
Published in
5 min readMar 19, 2021

--

I rate all the movies I watch on IMDb and the website allows you to download a nice .csv with all your ratings. This .csv contains basic information about the movies. In order to perform topic modeling, I need the plots and/or summaries of the movies. I will grab this information from Wikipedia and use it to enrich the IMDb dataset. Then I will perform LDA for topic modeling on the plots+summaries of the movies to find 6 topics.

I will keep the article clean of code. The code is available here.

The purpose of this article is to:

  • Use Wikipedia to grab movies and more specifically their Summaries and Plots.
  • Merge IMDb data with Wikipedia.
  • Build, Evaluate and Visualize an LDA model

A sample of the IMDb ratings dataset looks like this:

I have to search Wikipedia for each title and grab some information.

Wikipedia has a simple API called wikipedia. We can perform a search query using wikipedia.search(term, results=20). The search term will be the title of the movie. However, most of the time, the IMDb title does not match the correct Wikipedia title for that movie. I wrote a function called find_title_on_wikipedia (again, the code is available here) where I capture many corner cases and effectively find the…

--

--

Dimitris Effrosynidis
Analytics Vidhya

I read and learn every day. | Data Science • Personal Finance • Self-Improvement | Ph.D. https://www.linkedin.com/in/dimitrios-effrosynidis/