Data Analysis End-to-End IMDb dataset

Ognish Banerjee
Analytics Vidhya
Published in
21 min readJul 29, 2020

--

Data analysis on IMDb data

Introduction

Since it’s my first article on Medium. I thought of writing a detailed explanation of my analysis of the very popular yet common dataset on the IMDB movie rating. Anyone who is a newbie and beginning a journey in data science surely came across this data set. Different types of analysis, predictive modeling, articles, etc are all over the net on this, but even if you haven’t, don’t worry. Before writing this one, I also went through many blogs, Kaggle kernels, and GitHub notebooks on it, I still believe that this analysis is quite different from the rest and I decided to share my readers who are interested in learning beyond the mainstream :-D

To give a brief about my background, I started my career as an analyst and now pursuing PG in Data Science and Machine Learning, and this analysis was an assignment as part of our curriculum. So all the different visualization and analysis that I’ll be going through in this article has been meticulously crafted by our faculties and industry experts and therefore the reason for it standing out from the rest.

How we will proceed

I would like to state that the whole analysis has been done on Python and the platform is Jupyter notebook. Anyone who has prelim knowledge of the language can easily understand the code. If anyone wants to have the R version of it, please let me know, I’ll start making an R- notebook. But I would suggest trying to replicate the same on your own! That will help you learn.

Before I begin I would like to set the expectations of this article. This article aims to make you think like an analyst, how to approach a business problem analytically, what is the business logic behind it, etc. It contains simple python codes, basic viz, and no predictive modeling — because the objective here is not to share knowledge on coding or any high-level stats. Since it would be a bit long blog, my suggestion to the readers would be not to read this all at once, try breaking them in parts, and whenever you think it’s too much, just stop! bookmark the page and come back after having a cup of coffee!!

Enough of talking let’s dive into it:

Link to IMDb dataset and data description

IMDb data set

“Your approach is your key to success”

A scientific approach should be followed for solving any business problem using data we have in hand to derive valuable insights and recommendations.

Analysis flow diagram
Data analysis flow

My suggestion to everyone would be not to see the code or explanation at first. Read the various steps and try to solve it on your own, give some time to it, and when you think its enough then look into the various code chunks and explanations!

Good Luck!! :-D

First, we’ll carry out the usual steps needed for the analysis after that slowly we’ll start building context and solve problems.

Importing libraries

Import numpy, pandas, seaborn and matplotlib and assign aliases to them for easy use

Reading the data

  • Read the movies data

Read the movies data file provided and store it in a dataframe movies and get the top 5 rows of the dataframe

Now Think!

After reading the data, what to do next? In fact, what should we do next? the answer is to understand the high-level structure of our data, and how do we do that?

These are some common ways analyst usually follow to get an uber understanding of their data by inspecting their:

  1. Volume
  2. Shape
  3. Feature types
  4. Feature statistics
  5. Missing values

Any idea why the analysts always do this? — At times when they handle huge datasets with 100’s of columns and millions of rows, it becomes a daunting task for them to manage them as a whole. So, it is important for them to know their data in a crisp format.

  • Inspecting dataframe

Inspecting the dataframe for dimensions, null-values, and summary of different columns

Now, certain code results are not added to restrict the size of the article.

Okay! So, what do we get here? We have the dimension, all the features with their data types, missing values, and basic statistics. These pieces of information are enough for now to get an idea of our dataframe

Now, Just pause for a moment….

Light up the mind!

Before we jump into the analysis part, I want you to brainstorm in the following manner:

  1. What is the business problem we are about to solve and what is my objective? Think!
  2. What kind of analysis do I want to do and why? Think!
  3. What business impact can I create with my results? Think!
  4. Who are my end-users or consumers of this report? Think!

There can be n- number of ways of approaching this. One of them is being mentioned here-

Imagine I am working under a big production house, — — — . And my manager knows nothing but ROI. S/he told me, ‘Ognish I am planning to invest in a new film. Give me a report that will help me in taking decisions on what kind of movies should I invest in so that I have the maximum profit’

That’s it!!

The first thing I will do is to break my problem in parts and think straight, First, let me see how these different movies performed in terms of ROI and then who are the top performers. For that I would be needing these basic information:

  1. Movie attributes
  2. The budget of the movie
  3. Profit earned by the movie

Cool!! Let's get started

“Torture your data and it will confess!!”

Data Analysis

  • Reducing digits for better readability

The numbers in the budget and gross are too big, compromising its readability. Let's convert the unit of the budget and gross columns from $ to million $ first

  • Let's talk Profit!
  1. Create a new column called profit which contains the difference between the two columns: gross and budget
  2. Sort the dataframe using the profit column as reference
  3. Extract the top ten profiting movies in descending order and store them in a new dataframe — top10

4. Plot a scatter or a joint plot between the columns budget and profit and write a few words on what you observed

You can ignore the above step and use the default style and color.

What exactly my data want to say…..?

So here our first visualization, What can we infer?

My observations are as follows:

  • Individually inspecting budget and profit, both are slightly skewed to right, which means most of the movies have positive profit and also budgets are high
  • We can observe that there are a couple of movies that have very high budgets but still having negative profits
  • We can see a slightly positive trend between ‘budget’ and ‘profit’ i.e, as budget increases, profit also increases
  • Between 0–100 mln budget, we can see a good number of movies providing profit ranging from 0–300 mln dollars (excluding negative profit movies)

The dataset contains the 100 best performing movies from the year 2010 to 2016. However, a scatter plot tells a different story. You can notice that there are some movies with negative profit. Although good movies do incur losses, there appear to be quite a few movies with losses. What can be the reason behind this? Let us have a closer look at this by finding the movies with negative profit.

5. Extract the movies with a negative profit and store them in a new dataframe — negative_profit

Can you spot the movie Tangled in the dataset? You may be aware of the movie 'Tangled'. Although its one of the highest-grossing movies of all time, it has a negative profit as per this result. If you cross-check the gross values of this movie (link: https://www.imdb.com/title/tt0398286/), you can see that the gross in the dataset accounts only for the domestic gross and not the worldwide gross. This is true for may other movies also on the list.

Great!! We have done our first analysis…..

What Next?

Be a data detective!

Now, let’s say I want to find out the popular movies, by popular I mean they have a good public response.

How do we find that?…

You might have noticed the column MetaCritic in this dataset. This is a very popular website where an average score is determined through the scores given by the top-rated critics. Second, you also have another column IMDb_rating that tells you the IMDb rating of a movie. This rating is determined by taking the average of hundred-thousands of ratings from the general audience.

  • The General Audience and the Critics

As a part of this analysis, we will figure out the highest rated movies which have been liked by critics and audiences alike.

  1. Firstly you will notice that the MetaCritic the score is on a scale of 100 whereas the IMDb_rating is on a scale of 10. First, convert the MetaCritic column to a scale of 10.
  2. Now, to find out the movies which have been liked by both critics and audiences alike and also have a high rating overall, you need to -
  • Create a new column Avg_rating which will have the average of the MetaCritic and Rating columns
  • Get the list of movies in descending order of Avg_rating

So, we have all the movies in the descending order of their average popularity

Now, I want to dig a bit more to find out which are the movies that both MetaCriticand IMDb_rating on a higher scale and that are very popular.

Any idea how to go about it? Think!

“Remember, not everything that you want will come from the data, at times you need to be innovative enough and think like a data detective to come up with your own formulas which you think can strengthen the art of analysis. This is the real power of an analyst. They always think out of the box!!”

Meanwhile here it goes:

  • Retain only the movies in which the absolute difference(using abs() function) between the IMDb_rating and Metacritic columns are less than 0.5. Refer to this link to know how abs() function works - abs()
  • Sort these values in descending order of Avg_rating and retain only the movies with a rating equal to higher than 8 and store these movies in a new dataframe UniversalAcclaim

Great! We have created a new metric of our own.

Now, your production company is looking to make a blockbuster movie. There will primarily be three lead roles in your movie and the company wishes to cast the most popular actors for it.

Since your manager doesn’t want to take a risk, he asked you to cast a trio that has already acted together in a movie before.

So our next task would be-

  • Find the Most Popular Trios — I

The metric that I’ve chosen to check the popularity is the Facebook likes of each of these actors

The dataframe has three columns to help you out for the same, viz. actor_1_facebook_likes, actor_2_facebook_likes, and actor_3_facebook_likes

Our objective is to:

  1. Find the trios which have the most number of Facebook likes combined. That is, the sum of actor_1_facebook_likes, actor_2_facebook_likes, and actor_3_facebook_likes should be maximum.
  2. Find out the top 5 popular trios, and output their names in a list

Hmm, I got the top popular actors, but I am not satisfied with the result. Let’s take it to the next level

  • Find the Most Popular Trios — II

In the previous subtask, you found the popular trio based on the total number of Facebook likes. Let’s add a small condition to it and make sure that all three actors are popular. The condition is none of the three actors’ Facebook likes should be less than half of the other two. For example, the following is a valid combo:

  • actor_1_facebook_likes: 70000
  • actor_2_facebook_likes: 40000
  • actor_3_facebook_likes: 50000

But the below one is not:

  • actor_1_facebook_likes: 70000
  • actor_2_facebook_likes: 40000
  • actor_3_facebook_likes: 30000

since in this case, actor_3_facebook_likes is 30000, which is less than half of actor_1_facebook_likes

Having this condition ensures that you aren’t getting any unpopular actor in your trio (since the total likes calculated in the previous question doesn’t tell anything about the individual popularities of each actor in the trio.)

You can do a manual inspection of the top 5 popular trios you have found in the previous subtask and check how many of those trios satisfy this condition. Also, which is the most popular trio after applying the condition above?

Again, time to apply logic!

The mash of logic and science

1. Apply the logic discussed above

2. Get all the trios that satisfy the condition

3. Also, check whether there are any common trios between this and the previous result

Now I’m quite happy with my result. And tbh the list even has my favorite superstars. What about you guys? Can you spot any of your favorite stars on the list?

There is a column named Runtime in the dataframe which primarily shows the length of the movie. It might be interesting to see how this variable is distributed. Plot a histogram or distplot of seaborn to find the Runtime range most of the movies fall into.

  • Runtime analysis

Okay, another observation we had, we can see most of the movies falls within 120–130 min of runtime.

Let’s do some analysis on R- rated movies. Although R rated movies are restricted movies for the under 18 age group, still there are vote counts from that age group. Among all the R rated movies that have been voted by the under-18 age group.

All the parents in the house, the result is for you guys to keep a check on your children:- P

  • R- rated movies

1. Filter out movies that are R- rated

2. Sort them by ‘CVotesU18’ in descending order

3. Get the top 5

Hmm. Seems like ‘Deadpool’, ‘Wolf of Wall Street’ are very popular amongst the young.

Now moving ahead, did you notice that there are plenty of columns names as ‘CVotes’ and ‘Votes’ across various demographics. Can we bring out something from them? Give a thought!

If I am not wrong my manager wants to make a movie that gives high ROI. (Always go back to the business objective) High ROI means that the movie should be popular amongst people so that they buy tickets for it. How do we do that? How can we make use of these demographics data to our analysis!

If you take a look at the last columns in the dataframe, they provided a good quantitative sense of the voters (in the last analysis we made use one of the column— CVotesU18) and also have three genre columns indicating the genres of a particular movie. Can we do something with them? Think!

Okay, this will be the final stage of our analysis wherein we will analyze the voters across all demographics and also see how these vary across various genres. So without further ado, let’s get started with demographic analysis

Demographic analysis

Demographic analysis
  • Combine dataframe by genre

There are 3 columns in the dataframe — genre_1, genre_2, and genre_3. As a part of this analysis, we need to aggregate a few values over these 3 columns.

  1. First, create a new dataframe df_by_genre that contains genre_1, genre_2, and genre_3 and all the columns related to CVotes/Votes from the movies data frame. There are 47 columns to be extracted in total.
  2. Now, add a column called cnt to the dataframe df_by_genre and initialize it to one. We will realize the use of this column by the end of this analysis
  3. Group the dataframe df_by_genre by genre_1 and find the sum of all the numeric columns such as cnt, columns related to CVotes and Votes columns and store it in a dataframe df_by_g1
  4. Performing the same operation for genre_2 and genre_3 and store it dataframes df_by_g2 and df_by_g3 respectively

5. Now that we have 3 dataframes performed by grouping over genre_1, genre_2, and genre_3 separately, it's time to combine them. For this, add the three dataframes and store it in a new dataframe df_add, so that the corresponding values of Votes/CVotes get added for each genre. There is a function called add() in pandas which lets you do this. You can refer to this link to see how this function works. Pandas add()

6. The column cnt on aggregation has basically kept the track of the number of occurrences of each genre. Subset the genres that have at least 10 movies into a new dataframe genre_top10 based on the cnt column value.

7. Now, take the mean of all the numeric columns by dividing them with the column value cnt and store it back to the same dataframe. We will be using this dataframe for further analysis in this task unless it is explicitly mentioned to use the dataframe movies

8. Since the number of votes can’t be a fraction, typecasting all the CVotes related columns to integers. Also, round off all the Votes related columns up to two digits after the decimal point.

If we take a look at the final dataframe that you have gotten, you will see that you now have the complete information about all the demographic (Votes- and CVotes-related) columns across the top 10 genres. We can use this dataset to extract exciting insights about the voters!

“Chart” time!!!

Analysis charts
  • Genre counts

We will now make a bar chart plotting different genres vs cnt using seaborn and derive some insights from this data frame

Okay!! The data set of the top 100 movies has Drama genre the most. For now, we can’t say anything about the popularity of genres simply by seeing this chart. This chart only gives the numbers and nothing else.

Let's deep dive further!!

If you have closely looked at the Votes- and CVotes-related columns, you might have noticed the suffixes F and M indicating Female and Male. Since we have the vote counts for both males and females, across various age groups, let's now see how the popularity of genres vary between the two genders in the dataframe

  • Gender and Genre

Note: Use genre_top10 dataframe for this analysis

  1. We will make the first heatmap to see how the average number of votes of males is varying across the genres. Use a seaborn heatmap for this analysis. The X-axis should contain the four age-groups for males, i.e., CVotesU18M,CVotes1829M, CVotes3044M, and CVotes45AM. The Y-axis will have the genres and the annotation in the heatmap tell the average number of votes for that age-male group
  2. Make the second heatmap to see how the average number of votes of females is varying across the genres. Use a seaborn heatmap for this analysis. The X-axis should contain the four age-groups for females, i.e., CVotesU18F,CVotes1829F, CVotes3044F, and CVotes45AF. The Y-axis will have the genres and the annotation in the heatmap tell the average number of votes for that age-female group

Great! Let’s now derive some insights and observations from the above chart

Inferences: A few inferences that can be seen from the heatmap above is that males have voted more than females, and Sci-Fi appears to be most popular among the 18–29 age group irrespective of their gender. What more can we infer from the two heatmaps that are plotted?

  • Inference 1: Age ranging from 18–44 have voted the most irrespective of their gender
  • Inference 2: Even if Sci-Fi has a lesser count compared to other genres (previous bar chart) still it has the highest number of votes
  • Inference 3: U18M voted more than U18F. In general, movies seem to be watched less by U18 and 45A irrespective of their age or they watched but didn’t vote
  • Inference 4: Females on average (ages ranging from 18 to 44) voted movies of Action, Adventure, Animation, Comedy, Romance and Thriller more (after Sci-Fi) compared to males
  • Inference 5: Romance is the least voted by Male irrespective of their age compared to other genres (taking only the gender male), the number of votes may be higher than females but if only male gender is considered, they voted least for romance.

3. Make the second heatmap to see how the average number of votes of females is varying across the genres. Use a seaborn heatmap for this analysis. The X-axis should contain the four age-groups for females, i.e., VotesU18F, Votes1829F, Votes3044F, and Votes45AF. The Y-axis will have the genres and the annotation in the heatmap tell the average number of votes for that age-female group

Inferences: Sci-Fi appears to be the highest rated genre in the age group of U18 for both males and females. Also, females in this age group have rated it a bit higher than the males in the same age group. Some of the other inferences can be -

  • Inference 1: It is interesting to see that though the average number of votes for romance is less by male the average rating is more or less same as females, this means romance movies, in general, are watched less or voted less by males but the movies are good as they are rated well irrespective of gender especially for U18
  • Inference 2: Irrespective of gender, age ranging from 30–45 have their average rating to different genres is around 7.7 to 7.8, most of them didn’t cross 8. A slight observation can be made (no causal relationship) that as your age increases you tend to become a critic
  • Inference 3: We can see Animation genres has been voted steadily in Female gender, whereas in the male there is a significant difference (decrease) as age increases, It is interesting to observe that females of all age likes animation movies

You might need the link for formatting your heatmap.

Can we something more? As an analyst, you should always be unsatisfied and curious :-P. Always try to hit the next mile!!

Can you see the dataset contains both the US and non-US movies? Let’s analyze how both the US and the non-US voters have responded to the US and the non-US movies

  • The USA vs non-USA cross-analysis

Note: Use the movies dataframe for this subtask. Make use of this documentation to format your boxplot - boxplot

  1. We will here create a column IFUS in the dataframe movies. The column IFUS should contain the value "USA" if the Country of the movie is "USA". For all other countries other than the USA, IFUS should contain the value non-USA.
  2. Now make a boxplot that shows how the number of votes from the US people i.e. CVotesUS is varying for the US and non-US movies. Make use of the column IFUS to make this plot. Similarly, make another subplot that shows how non-US voters have voted for the US and non-US movies by plotting CVotesnUS for both the US and non-US movies.

Inferences:

  • Inference 1: We can observe the number of votes on average is more from non-USA people compared to USA people irrespective of the origin of the USA or non-USA movies
  • Inference 2: We can observe non-USA movies the number of votes is uniformly distributed compared to USA movies
  • Inference 3: We can see there is a significant difference in the number of votes from USA people for USA and non-USA movies. USA people voted less in number for non-USA movies
  • Inference 4: From both the plots we can see some outliers for USA movies
  • Inference 5: Overall USA movies got a greater number of votes from both USA and non-USA people compared to non-USA movies

3. Again we’ll do a similar analysis but with the ratings. Make a boxplot that shows how the ratings from the US people i.e. VotesUS is varying for the US and non-US movies. Similarly, make another subplot that shows how VotesnUS is varying for the US and non-US movies.

Inferences:

  • Inference 1: On average we can see the median rating is higher (around 7.9 to 8) from USA people compared to that from non- USA people
  • Inference 2: It is interesting to see that both USA and non-USA people have rated non-USA movies significantly less than that of USA movies (difference for non-USA people is greater than that of USA people)
  • Inference 3: Distribution of rating for non-USA movies by USA people has a uniform distribution compared to others

Combined Inference: An important observation can be derived from both the pair of boxplots is that even though the median number of votes from non-USA people is much higher than of USA people but their median rating is less than that compared to US people, one of the many possible reasons can be

  • Population- Assuming, outside the USA means voters from Asian countries, European countries, etc which constitute large chunk, so as the population is more we can have more number of critics also
  • Top 1000 voters vs Genres (last analysis!!) :P

You might have also observed the column CVotes1000. This column represents the top 1000 voters on IMDb and gives the count for the number of these voters who have voted for a particular movie. Let's see how these top 1000 voters have voted across the genres.

  1. Sort the dataframe genre_top10 based on the value of CVotes1000in descending order.
  2. Make a seaborn barplot for genre vs CVotes1000

Inferences:

  • The romance genre has been voted the least from the top 1000 voters
  • Sci-Fi is the most popular amongst the top 1000 voters
  • There is not a significant difference in popularity amongst Action, Thriller and Adventure, which makes sense also like these three, in general, goes hand in hand
  • If we compare this bar-chart with that of the previous we can see that the top 1000 IMDb voters prefer Sci-Fi over Drama and also the dataset contains movies from Drama more compared to other genres
  • We can see that even if Sci-Fi has very few movies in the data set (as depicted in the previous count- bar chart). In general, they got the most number of votes from Male, Female and also Top 1000 Imdb voters and also they have the highest rating across genders across their age group

With these, our analysis is over. I know it is a lot, so again I’ll be stressing don’t go through this all at once, stop whenever you are feeling exhausted, come back again after some time. Again the aim of this analysis is to make you think like an analyst and not just doing an end to end project. Use your free time, do explore the dataset further on your own, and see what kind of other insights you can get across various other columns.

I would love to get your feedback on it as it will also keep me motivated. Do let me know what things I can modify so to make it more engaging and understandable. You can also suggest me on specific topics to cover, I'll work on that in my future medium articles.

You can contact me through my LinkedIn account

Happy Analyzing!

--

--