Explore and Visualize IMDB Data
Find top-rated movies, series and episodes using these interactive dashboards
I find the IMDB ratings pretty accurate most of the time, and if somebody suggests a movie or series to watch, I usually check its IMDB rating to see whether it’s worth the time.
Although there are lists of top-rated movies and series available in IMDB website which are helpful for finding good movies or series to watch, but the filters available for these lists are limited. For example if you want to filter by year and find the best series in the last two years, you cannot do that. Also you cannot apply a minimum number of votes to the suggested lists. The minimum number of votes is important because sometimes poor movies/series have a high rating due to a small number of votes with high points.
These limitations persuaded me to create my own lists out of IMDB data to be able to easily find top-rated movies/series to watch. At first I thought I need to scrape the data off their website because they don’t provide an API for their website, but then I found out that there are daily data dumps available for download. In the sequel we go over the steps required to go from data dump files to a fully operational online dashboard.
Obtaining the Data
The data dump files are available here (also see this page for metadata). We download three files: title.basics, title.episodes, and title.ratings.
These files have the basic information that we need. If you want names of the directors, writers and cast of the movies/series, you can download the extra files title.crew and title.principals.
We then extract the files and rename them as needed.
Transferring Data to the Database
We create three tables in a relational database to store the content of these files (I used PostgreSQL in AWS RDS). We then transfer the data to the tables by making a connection to the database that the tables are stored in, and then opening the data files, reading them and bulk copying to the relevant tables:
We also need to create a helper table that holds values for different genres:
Then we create two tables to be used by the two dashboards mentioned earlier. We need to join some tables. The table for Movies/Series is fairly straightforward. For creating the table for Episodes, we need to join with source_basics table twice; once for getting the info of episodes and once for the info of the series:
Make sure to index the tables as well, because the size of the data is fairly large (one of the tables has ~500K rows and the other one ~400K rows). Of course the dashboard loading speed depends on the computing power of the DB as well. We need two types of indexing:
- One for filtering the data when a filter has been applied. This can be done using a regular index.
- One for text pattern search. For example when the user wants to filter by movie name and they type the first few letters of the movie, all the movies that start with those letters should be shown to the user so they can select from the list. In PostgreSQL, depending on the column data type, this kind of index is called
text_pattern_ops. You can find similar indices for the RDBMS that you are using.
Creating the Dashboards
Now it’s time to create dashboards from these tables. We used Metabase for this purpose (it is an awesome free and open-source dashboarding and data exploration tool), but you can use any tool that you prefer.
We create cards using the “Native Query” method of Metabase as follows, and then create dashboards and add the cards to the dashboards.
After creating the dashboards we create some filters so that users can filter the data based on title, number of votes, genres, years, etc. The final step is to enable the public links for the dashboards, so that people can see the dashboards using the links.
This sums up the process of creating the dashboards. Note that the above dashboards get updated with new data every month.
You can use the GitHub repository of the project for more details, proposing enhancements and reporting issues.