Exploratory Data Analysis on 25k IMDb Movies

Using Pandas, Matplotlib, and Seaborn libraries in python

Dipendu Pal
12 min readMay 11, 2023
photo by Inga Seliverstova from Pexels

IMDb, short for Internet Movie Database, is an online database that encompasses a wide range of information regarding movies, TV shows, podcasts, home videos, video games, and streaming content. It serves as a hub for detailed data, including cast and production crew profiles, personal biographies, plot summaries, interesting trivia, user ratings, and both fan and critical reviews. Initially emerging as a fan-driven movie database within the “rec.arts.movies” Usenet group in 1990, IMDb made its transition to the World Wide Web in 1993. Since 1998, it has been owned and operated by IMDb.com, Inc., a subsidiary of Amazon.

For exploratory data analysis (EDA), We’ll be downloading and using 25k IMDb Movie Dataset from Kaggle .

For this project, we will utilize essential Python libraries such as Pandas, Matplotlib, and Seaborn. Pandas will play a crucial role in conducting various operations on the dataset, including data cleaning and analysis. Additionally, Matplotlib and Seaborn will be employed to generate insightful visualizations.

The following steps will be performed to complete this project:

  • Downloading the dataset.
  • Data preparation and cleaning.
  • Exploratory analysis and visualizations.
  • Asking and answering questions.

Here is the link to the project notebook hosted on Jovian:

Downloading the dataset

we shall download the dataset using opendatasets library.

when you run the above cell, you will be asked to provide your Kaggle username and Kaggle API key. You can obtain these credentials from your Kaggle account by following these steps:

1. Go to your profile section on Kaggle.
2. Click on the account settings.
3. Scroll down to locate your username and API key.
4. Use the obtained username and API key to download the dataset.

Data preparation and cleaning

Before we start to analyze the data we’ve downloaded and stored earlier we need to prepare and clean the data. The dataset we’ll be working on may have many missing data, some of the data will be wrong so we’ll be removing a lot of data and also we’ll generate a lot of data from the existing data. we’ll be using Pandas library to achieve that.

To begin, we will import the Pandas library and convert the dataset, which is currently in ‘.csv’ format, into a Pandas dataframe.

let’s check the how many rows and columns are there and check some basic info.

let’s remove all the empty cells and respective rows

the path column is not needed so lets remove it.

To gain an initial understanding of the data, we will retrieve a small sample from the dataset.

as we can see from the above that the wrong data had been entered in the ‘Run Time’ column , so the column need to be removed.

operations will be performed below to remove all the rows with ‘0’ User Rating, ‘no-rating’ Rating values

now we need to remove the identical rows if there is any.

let’s remove all the ‘K’s and ‘M’s from the ‘User Rating’ series and convert its values to numerical values.

Now let’s check a sample again.

Now we are going to convert ‘Rating’ series values to ‘float64’ type.

now we need to remove ‘-’ and all the unwanted non-numerical characters from the values from the ‘year’ series.

To identify and remove all the data with identical ‘movie title’ and ‘year’ we’ll create a boolean series using ‘.duplicated()’ method.

now we will convert ‘year’ series to pandas datetime column and extract year from it and replace the ‘year’ series with it.

let’s check and drop if there is any ‘NaN’ values inside ‘year’ series.

let’s convert the data type of ‘year’ series to ‘int64’ type.

Now we’ll sort and check the dataframe putting the year series in descending order.

As we can see ‘Aladdin 2’ has year 2025 that means it has not released yet that’s why it cannot have a valid rating. So we will remove the row.

Now we’ll sort and check the dataframe putting the year series in ascending order.

Now the dataset has been cleaned and prepared for analysis.

Exploratory analysis and visualizations

Now as we’ve cleaned and prepared the dataset we’ll head towards analyzing the dataset. We’ll explore the columns inside the DataFrame. We’ll check if we can establish any relationship between those columns. we’ll be using matplotlib and seaborn libraries to plot the data and from the plot we’ll gain insight about the data.

Let’s begin by importing`matplotlib.pyplot` and `seaborn`.

  1. Exploring the ‘Generes’ series:

After cleaning the ‘movie_raw_df’ dataframe only movies with unique name remains. So now the frequency or count of a genre means number of movies of that particular genre. Let’s crate a dataframe of unquie value of genres.

let’s create a DataFrame containing two columns. one column for the genres and the other for number of movie on that particular genre.

let’s find top 10 highest occurring genres.

The bar plot depicts the count of movies belonging to the top ten genres with the highest number of movies. The x-axis representing the number of movies and the y-axis representing the ten genres with highest number of movies.

we can draw the following insights from the above bar plot:

  • ‘Drama’ and ‘Action,Crime,Drama’ are the two genres with the most movies.
  • As we can see a movie can be combination of various genres.Among the top 10 genres in terms of number of movies ‘Drama’ is the most common genre, followed by ‘Crime’ and ‘Comedy’.
  • Number of movies in ‘Drama’ and ‘Action,Crime,Drama’ genre is more than double than ‘Comedy’ and ‘Crime,Drama,Mystery’

2. Finding the relation between Rating and User Rating:

We’ll find the relation between ‘Rating’ and ‘User Rating’(user rating means number of user submitted their rating) using scatter plot.

The scatter plot shows the relationship between the rating of movies and user rating of the movies. each dot represents a movie. the x-axis representing the ‘User Rating’ and the y-axis representing the ‘Rating’

We can see from the scatter plot that:

  • A cluster has formed between 0 to 1 million user rating value in the x-axis and 4 to 9 rating in y-axis. that means most of the movies fall under the mentioned range of ‘Rating’ and ‘User Rating’.

now we’ll plot the regression line to see the relationship between them.

In this regression plot x-axis represents ‘User Rating’ of a movie y-axis represents ‘Rating’ of a movie. The red line represents the linear regression model fit to the data, which indicates a positive relationship between user rating and rating.

We can conclude from the regression plot that:

  • It is a positive relationship between user rating and rating; that means as the user rating increases, rating also increases.

3. Directors with highest number of movies:

We will look for top 10 director with highest number of movies.

First we’ll find unique value of directors in the ‘Director’ series.

Now we will create a dataframe of director and number of movies per director.

Now we’ll find 10 directors with highest number of movies and create bar graph

The bar plot depicts the count of movies belonging to the top ten directors with the highest number of movies. The x-axis representing the number of movies and the y-axis representing the ten directors with highest number of movies

We can see from the above bar plot that:

  • Woody Allen has directed highest number of movies and the number is 46.
  • John Huston has directed the tenth most movies of all time, he has directed 28 movies in total.
  • Clint Eastwood has directed second highest number of movies, the number is 38.

4. Number of movies with respect to the years:

We will find the number of movies with respect to years using histogram plot.

This histogram plot shows the distribution of number of movies. The x-axis represents the year range, divided into 9 bins of equal width while the y-axis represents the frequency of movies within each bin.

So from the histogram plot we can see that:

  • From year 1920(roughly) to 1930(roughly) number of movies in minimum.
  • From year 2010(roughly) to year 2020(roughly) number of movies is maximum and nearly double than last decade. It seems number of movies growing exponentially decade wise.lets check with distribution curve.

This distribution plot is just an extension of the above histogram plot as we wanted to check the distribution curve.The ‘kde’ parameter is set to ‘True’, which adds a kernel density estimate curve to the plot. the blue curve represents the kernel density estimate curve.

So we can roughly estimate from the distribution curve that:

  • Number of movie production has grown exponentially over the year.

Asking and answering questions:

In this section we will ask five interesting questions and try to answer them.

Q1: which director who has directed more than 10 movies has highest average rating and director having lowest average rating?

First we’ll find average rating of directors using ‘.groupby()’ method and then calculate the average using ‘.mean()’ method.

Now we’ll merge ‘director_movie_df’ DataFrame which we’ve created earlier with ‘direc_avg_df’ on ‘Director’ series . we’ll use ‘.merge()’ method. ‘.merge()’ helps us to merge two DataFrames together.

Now we’ll create a DataFrame from our ‘merged_df_direct’ sorting by the ‘no_of_movies’ series.

The barplot shows top ten directors with highest average rating. The x-axis represents the Directors and the y-axis represents the average rating.

so we can say from the above analysis and the barplot that:

  • Among the directors who have directed more than 10 films, Christopher Nolan has the highest average rating, and the rating is 8.154

The barplot shows ten directors with lowest average rating. The x-axis represents the Directors and the y-axis represents the average rating.

so we can say from the above analysis and the barplot that:

  • Among the directors who have directed more than 10 films, Andrew Jones has the lowest average rating, and the rating is 2.809

Q2: Determine which decade has the highest average rating among movies with more than 100 releases.

first we need to crate a new series in the ‘movie_raw_df’ DataFrame for decade.

let’s create a DataFrame by grouping the data by the ‘decade’ series and perform ‘mean()’ method on the ‘Rating’ column to find the average rating of each decade.

lets count the movies decade-wise.

now we’ll create a dataframe from using the ‘decade_movie_count’ series. the dataframe will contain two columns one for the decades and other to store number of movies on a particular decade.

let’s merge the ‘decade_rating_df’ with the ‘decade_count_df’.

lets create a new dataframe from ‘merged_decade_df’ excluding all the decades having movies less than 100.

The above barplot shows decades with highest average rating with more than 100 movies. The x-axis represents decades and the y-axis represents average ratings.

so we can conclude from the above analysis and the barplot that:

  • So movies from the 1940s have the highest average rating, 6.835 followed by 1930s movies and then 1950s movies and so on.

Q3: find top 5 directors with highest average user rating?

First we’ll create a dataframe by grouping our ‘movie_raw_df’ dataframe by the ‘Director’ column and we’ll apply mean() method on the ‘User Rating’ column to find the average user rating for the directors.

now lets sort the ‘director_user_rating’ dataframe by descending ‘User Rating’ values.

The above barplot shows top 5 directors with highest average user rating. The x-axis represents the directors and the y-axis represents average user ratings.

From the above analysis and the barplot we can conclude that:

  • Christopher Nolan has the highest average user rating with a value of: 1245000
  • Frank Darabont has the second highest average user rating with a value of: 1064250
  • Roger Allers has the third highest average user rating with a value of: 1000000
  • David Fincher has the fourth highest average user rating with a value of: 847250
  • Joss Whedon has the fifth highest average user rating with a value of: 844333

Q4: Identify the top 5 genres with more than 100 movies that have the highest average rating.

let’s group our ‘movie_raw_df’ dataframe’ by the ‘Generes’ column and apply mean() method on the ‘Rating’ series to find the average rating of each genre.

we will merge the ‘genre_rating_df’ with the ‘genre_freq_df’ which we’ve created earlier. we’ll merge the ‘genre_freq_df’ to access the number of movies of each genre. we’ll merge them on the ‘Generes’ column, so to do that the values of the columns need to match. To match the values we’ll sort both the dataframes by the ‘Generes’ series in ascending order.

Now we’ll merge the ‘genre_freq_df’ with the ‘genre_rating_df’.

now we will eliminate all the genres with less than 100 movies.

let’s sort the ‘new_merged_genre’ dataframe by the ‘Rating’ in descending order.

Now we’ll create a new DataFrame from the ‘new_merged_genre’ DataFrame which include only the top five highest rated genres.

The above barplot shows top 5 genres with highest average rating with more than 100 movies. The x-axis represents the average ratings and the y-axis represents the genres.

Now we can observe from the above analysis and the barplot that:

  • movies falls under the genre:[‘Biography’, ‘Drama’, ‘History’] has highest average rating which is:6.908
  • movies falls on the genre:[‘Biography’, ‘Drama’] has second highest average rating which is:6.906
  • movies falls on the genre:[‘Crime’, ‘Drama’, ‘Film-Noir’] has third highest average rating which is:6.8
  • movies falls on the genre:[‘Biography’, ‘Crime’, ‘Drama’] has fourth highest average rating which is:6.73
  • movies falls on the genre:[‘Drama’] has fifth highest average rating which is:6.54

Q5: What is the number of directors who have written and directed movies that have been rated an 8 or an 8+?

let’s find all the directors who has written their movies.

now we will eliminate all the movies with less than 8 rating.

now there can be more than one movies with 8 or 8+ rating where the director has also written their movies. so we will consider only unique values of directors or writers from the ‘common_val’ dataframe.

Inferences and Conclusion

Now as we have completed our tasks, let us see what we’ve learned from cleaning and preparing the dataset.

  1. sometimes data looks like correct data but with the help of web search we can check whether the data is correct or not.
  2. While analyzing a big set of data we must check some of the columns in both ascendng and descending order it can be helpful for us to find any anomaly in the dataset.

And the conclusions drawn from anlyzing the dataset are as follows:

  1. In this dataset we can see that the ‘Drama’ genre has the highest number of movies, followed by ‘action, crime, drama’ and ‘comedy, drama, romance’ and so on.
  2. The majority of the movies are rated between 4 and 8.
  3. The majority of the user rating are between 0 and 1 million.
  4. Rating increase as the user rating increase.
  5. Woody Allen has directed the highest number of movies followed by Clint Eastwood and Alfred Hitchcock and so on.
  6. There have been exponential increases in movie production over the last few decades.
  7. Among the directors who have directed more than 10 films, Christopher Nolan has the highest average rating, and the rating is 8.154
  8. Among the directors who have directed more than 10 films, Andrew Jones has the lowest average rating, and the rating is 2.809
  9. Movies from the 1940s have the highest average rating, 6.835
  10. Christopher Nolan has the highest average user rating with a value of: 1245000.
  11. So Among the directors who have directed more than 10 films, Christopher Nolan has both the highest average rating and the highest average user rating.
  12. movies falls under the genre:[‘Biography’, ‘Drama’, ‘History’] has highest average rating which is:6.908
  13. There are 199 directors who are also writers of their movies with a rating of 8 or higher

References and Future Work

now in the future, this dataset can be modified to insert two new columns, one for the money spent to make each film and the other column for the total box office collection. To do that we need to search for datasets with the mentioned criterion or look for a method so that such data can be generated. Then we can find the profit of each film, average budget and the profit ratio, the relation between rating and the box office collection of the films, the relation between user rating and box office collection of the films, top directors with the highest average profit, top writers with highest average profit, genres with the highest average box office collection etc.

References:

  1. 25k IMDb Movie Dataset: https://www.kaggle.com/datasets/utsh0dey/25k-movie-dataset
  2. cleaning empty cells: https://www.w3schools.com/python/pandas/pandas_cleaning_empty_cells.asp
  3. ‘idxmax()’ method: https://www.geeksforgeeks.org/python-pandas-dataframe-idxmax/
  4. How to Count Distinct Values of a Pandas Dataframe Column?: https://www.geeksforgeeks.org/how-to-count-distinct-values-of-a-pandas-dataframe-column/
  5. Pandas removing all special characters from columns: https://stackoverflow.com/questions/55299583/pandas-removing-all-special-characters-from-columns
  6. Analyzing Tabular Data with Pandas: https://jovian.com/learn/data-analysis-with-python-zero-to-pandas/lesson/lesson-4-analyzing-tabular-data-with-pandas
  7. Visualization with Matplotlib and Seaborn: https://jovian.com/learn/data-analysis-with-python-zero-to-pandas/lesson/lesson-5-data-visualization-with-matplotlib-and-seaborn
  8. pandas official documentation: https://pandas.pydata.org/docs/getting_started/intro_tutorials
  9. Converting K and M to numerical form in Pandas DataFrame: https://www.skytowner.com/explore/converting_k_and_m_to_numerical_form_in_pandas_dataframe
  10. pandas dataframe group year index by decade: https://stackoverflow.com/questions/17764619/pandas-dataframe-group-year-index-by-decade

--

--