Topic Modeling on my Watched Movies
I rate all the movies I watch on IMDb and the website allows you to download a nice .csv with all your ratings. This .csv contains basic information about the movies. In order to perform topic modeling, I need the plots and/or summaries of the movies. I will grab this information from Wikipedia and use it to enrich the IMDb dataset. Then I will perform LDA for topic modeling on the plots+summaries of the movies to find 6 topics.
I will keep the article clean of code. The code is available here.
The purpose of this article is to:
- Use Wikipedia to grab movies and more specifically their Summaries and Plots.
- Merge IMDb data with Wikipedia.
- Build, Evaluate and Visualize an LDA model
A sample of the IMDb ratings dataset looks like this:
I have to search Wikipedia for each title and grab some information.
Wikipedia has a simple API called wikipedia. We can perform a search query using wikipedia.search(term, results=20). The search term will be the title of the movie. However, most of the time, the IMDb title does not match the correct Wikipedia title for that movie. I wrote a function called find_title_on_wikipedia (again, the code is available here) where I capture many corner cases and effectively find the…