Analyzing IMDb’s Top 250 movies: Part 2; Extract Useful Datasets

Published in

Analytics Vidhya

8 min readJul 9, 2021

Extracting useful information from the IMDb Top 250 movies DataFrame

This post is part 2 of my journey into analyzing the IMDb Top 250 movies. In Part 1; way back in January, I went through how I scraped the data from the IMDb chart from the top-rated movies and could create an extensive DataFrame out of it. Sadly, life got in the way and I couldn’t get back to it till now. So, my joke about not doing anything with the data in part 1 did come true 😝

Anyways, in this part, I will go over how I extracted information and created useful datasets out of the main DataFrame. This will allow me to plot and analyze these movies and help me in understanding what factors determine if a movie is successful?

Again, for those who don’t want to read through the entire thing and are just interested in the code, Here is the Github Link to the Python Jupiter Notebook. Drop a⭐️ it if you liked it.

Before we begin extracting the data

The original DataFrame contains over 30 unique data points for each movie. This includes the details such as name, rating, and year of release; movie production details such as Production company, Director, writers, and Stars; its budgetary details, languages, genre, and many others.

Before I start the extraction of datasets, to make my life a lot easier, I chose to group the movies into Decades. It can be easily added to the DataFrame with a single line of code; by floor-dividing the value in the year column and then multiplying it by 10.

movie_data['decade'] = ((movie_data['year'] // 10).astype(int) * 10)

There is a good reason behind why I chose to group movies by decades. There were a couple of fields such as the number of movies, rating, etc., that I wanted to compare w.r.t. time. Given that the data contains only 250 unique points with release years, ranging from 1921 to 2020, I might have, in the worst case, ended up with overly flat data where a large number of these years have zero or one data point.

Note: The decade starts with the year defining the decade. Therefore, say the 2010s start from 2010 and end in 2019

Let’s get to extracting 📁

As mentioned earlier, I had shortlisted a few fields which I believed would provide useful information when compared w.r.t. time. Another type of comparison, I was interested in was comparing categorical data. So broadly speaking, I’ll be splitting the extraction process into 2 parts:

Extraction of datasets for time-based comparisons.
Extraction of categorical datasets.

Extraction of datasets for time-based comparison ⏰

The first and easiest dataset to extract is the number of movies per decade. Now you might assume that I will need some sort of counter to count the number of movies for each unique decade and then store them against that decade. Yeah, that is one way of achieving it. But let me show you how it can be done in a much easier way.

no_movies_per_decade = pd.DataFrame({
   "decade" : movie_data['decade'].value_counts().index,
   "movies" : movie_data['decade'].value_counts()
}).sort_values('decade').reset_index(drop = True)no_movies_per_decade_json = no_movies_per_decade.to_dict('records')

As they say, “All this is quite elementary, my dear.” You see, I don’t need to count each movie against a unique decade. The decade field does it for me. All I have to do is count how many times each decade is repeated and voila, I get my desired result. Also, the last line in that code is to convert the Dataset from a Pandas DataFrame to JSON array.

The next set of datasets I wanted to extract were the average rating and votes received by the movies in each decade. At this point, it is important to note that while pandas can implicitly identify numerical and non-numerical values, it is pretty limited in how intelligently it can perform this identification. Hence, while it properly identified ratings as float64, votes were identified as object, meaning I had to convert them to int before calculating the average.

movie_rating_per_decade = pd.DataFrame({
   "decade" : movie_data['decade'].value_counts().index,
   "mean_rating" : None
}).sort_values('decade').reset_index(drop = True)for i in range(len(movie_rating_per_decade)):
   decade_filter = movie_data['decade'] == 
                   movie_rating_per_decade.iloc[i,0]
   filtered_movies = movie_data[decade_filter]
   movie_rating_per_decade.iloc[i,1] = 
                   round(filtered_movies['rating'].mean(), 3)

The process of extracting the average movie ratings per decade is pretty straightforward. I started with a Dataframe containing each unique decade and the mean rating columns. However, I did not add any value to the mean rating column during the Dataframe definition. For this column, I am going to filter the movies for a particular decade, and then from these filtered movies, find the mean rating of the decade.

Ok, now let's deal with the small annoyance of votes being identified as an object and not an integer.

The solution is honestly very simple and should have been taken care of earlier. All I had to do is filter the values in the votes column with a simple lambda to allow only digits. Something like int(‘’.join(filter(lambda x : x.isdigit(), movies_data.iloc[i,6] )))

decade_to_vote =  pd.DataFrame({
   'decade' : movie_data['decade'].values,
   'votes' : movie_data['vote_count'].values
})for i in range(len(decade_to_vote)):
   decade_to_vote.iloc[i,1] = int(''.join(filter(lambda x : x.isdigit(), decade_to_vote.iloc[i,1] )))for i in range(len(movie_votes_per_decade)):
   decade_filter = decade_to_vote['decade'] == 
                   movie_votes_per_decade.iloc[i,0]
   filtered_votes = decade_to_vote[decade_filter]
   movie_votes_per_decade.iloc[i,1] = 
                   round(filtered_votes['votes'].mean(), 3)

Here, I added a little extra logic to help me in extracting the dataset. To start with, I added a DataFrame to get just the Decade and the vote columns separately. Then I performed the type conversion and the filtering similar to the ratings one on this DataFrame.

The last dataset I wanted to create was one involving the Box Office data. This means the Budget, the Gross collection for Worldwide, USA, and opening week in the USA. Retrieving these was the same as what I had to do for creating the Average votes dataset, but instead of having one column, I have 4 of them here.

I knew that I would have to convert all these four fields to integers as I had chosen to include the currency symbols for them.

movie_budget_per_decade = pd.DataFrame({
   "decade" : movie_data['decade'].value_counts().index,
   "mean_budget" : None,
   "mean_gross_worldwide" : None,
   "mean_gross_usa": None,
   "mean_opening_week_usa":None
}).sort_values('decade').reset_index(drop = True

This meant that the DataFrame looks something like this.

Categorical datasets and how to handle them

The columns I was interested in contain non-ordinal categorical data. Seeing it, many might go the route of encoding the data and then dealing with that can of worms. If no one has said it before, let me be the first to tell you, you don’t have to perform encoding every time you come across categorical data. You can get a lot of information directly without encoding data.

With that out of the way, let me show you why Pandas is such a preferred library for handling data and how it was able to simplify my life a whole lot.

value_counts() is a method provided by Pandas which returns a Series containing a count of unique values. It completely simplifies the counting of unique values. You no longer need to perform one-hot encoding and then count the occurrences. Pandas does it for you.
Pandas has a wide range of methods such as isnull(), notnull(), dropna() etc., that can help us in finding and dealing with missing data straight from the DataFrame.

Extraction of Categorical Datasets 🧮

For categorical data I wanted to get two main metrics, the number of occurrences of each unique point, and the average IMDB rating associated with it. As for what data I was interested in; I wanted to extract a few film-related details and a couple of general details. These are:

Directors, Production companies involved, censor rating, and genre.
Languages used, and countries in which the movie was shot.

director_data = pd.DataFrame({
   'director' : movie_data['director'].value_counts().index,
   'count' : movie_data['director'].value_counts(),
   'mean_imdb_rating' : None
}).sort_values('director').reset_index(drop = True)

For each of these datasets, the process is the same. To start with, I created a DataFrame with the index, its count, and a field for the Average IMDb rating. The value_counts() method allowed me to get the index and the count for each data point hassle-free. The process to calculate the Average IMDb rating is similar to one I had written earlier.

Additionally, in the case of fields that had no value, such as the second, third, and fourth language, etc., I chose to convert those fields into NaN and then drop them using the method dropna() provided by Pandas.

secondary_language_data['language'].replace(' ', np.nan, inplace=True)secondary_language_data.dropna(subset=['language'], inplace=True)

So in the end, the entire code to extract the categorical data (remember that all of them follow more or less the same premise) looks something like this:

primary_language_data = pd.DataFrame({
   'language' : movie_data['language_1'].value_counts().index,
   'count' : movie_data['language_1'].value_counts(),
   'mean_imdb_rating' : None
}).sort_values('language').reset_index(drop = True)primary_language_data['language'].replace(' ', np.nan, inplace=True)
primary_language_data.dropna(subset=['language'], inplace=True)for i in range(len(primary_language_data)):
   language_filter = movie_data['language_1'] == 
                     primary_language_data.iloc[i,0]
   filtered_movies = movie_data[language_filter]
   primary_language_data.iloc[i,2] = 
                     round(filtered_movies['rating'].mean(), 3)primary_language_data_json= primary_language_data.to_dict('records')

Having extracted all the datasets I wanted, the last step was to save them in the desired forms so that I can get to plotting and analyzing the data. Similar to what I had done in Part 1, I saved all the datasets in a JSON object as well as saved each DataFrame as a CSV file. I am not going to go over how I saved them here as I have already explained here.

Well, this concludes the 2nd part of my journey. All that is left is to plot and analyze the data. I will post the findings as soon as I am done analyzing them. In the meantime, I hope you enjoyed and found some useful information for the extraction of usable data. I also hope I was able to provide some pointers on just how powerful Pandas is as a library for data manipulation.

Once again, the complete code is available here at GitHub in Python Jupyter Notebook. Drop ⭐️ it if you liked it. If you have questions, doubts, or thoughts on this, please feel free to 👏 and comment. Thanks!