Exploratory Data Analysis (EDA) and Visualization Using Python — Netflix Original Films & IMDB Scores

Alina Tabish
5 min readSep 4, 2021

--

Photo by Alexander Shatov on Unsplash

Netflix’s popularity continues to rise year after year. Netflix is a subscription-based streaming service that allows users to watch TV shows and movies without commercials on any internet-connected device.

Its content varies by location and is subject to change over time. You can view award-winning Netflix Originals, TV series, movies, documentaries, and more.

This EDA will explore the Netflix Original Films and IMDB Score dataset through visualizations and graphs using the libraries such as Pandas, NumPy, Matplotlib, and Seaborn.

The Netflix Original Films & IMDB Scores dataset used for this EDA has been downloaded from Kaggle. This dataset consists of all Netflix original films released as of June 1st, 2021, and their IMDB Scores.

The columns included in this dataset are:

  • Title of the film
  • The genre of the film
  • Original premiere date
  • Runtime in minutes
  • IMDB scores
  • Languages available

Importing The Relevant Libraries

import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from urllib.request import urlretrieve

Loading The Dataset

Using the pandas library, we’ll read the CSV file and store it in the dataframe ‘df’.

# getting absolute path till the data file
abs_path = os.getcwd()[:-15]
data_path = abs_path + "NetflixOriginals.csv"

# reading the csv into the pandas dataframe
df = pd.read_csv(data_path,encoding='Windows-1252')

Let’s see the content of our dataframe.

Data Preparation And Cleaning

Prepping the data for the exploratory analysis

We can see that this dataset is pretty small with only 6 columns and 584 rows

This dataset has been already cleaned by the original author, hence there are no null values found in this dataframe.

Since we might need to analyze the premiere years and months of the films in the future so we’ll add a column for release year and release month of the films

Exploratory Analysis And Visualizations

Examining the data, let’s see if we can find any interesting statistics and observations.

# calculating the average of the IMDB scores of the netflix films

df_avg= df['IMDB Score'].mean() # the pandas mean() function returns return mean of the selected column

print('The overall average of all the IMDB scores of netflix films out of 10 is: ', df_avg)

The overall average of all the IMDB scores of netflix films out of 10 is: 6.27174657534246

Our next step is to look at the IMDB scores of each film using a histogram.

A histogram is a graph showing the frequency distribution and is often used for numeric data

This histogram shows us that most of the films’ IMDB scores lie between around 6.3–7.1

Let’s see how the different types genres of the films we have got using a bar chart

A bar chart is a type of graph that uses rectangular bars to represent data. A bar chart’s one axis measures a value, while the other axis lists variables.

From this, we can see that most of our data is about the documentary genre.

How about we have a look at the IMDB scores of each genre using a bar chart. We’ll filter out a few genres which have a film count of above 10

As we can see, the vast majority of films available on Netflix are in English.

Asking And Answering Questions About The Data

In this section, we’ll pose some intriguing questions about our data and see how we can address them using Pandas, Matplotlib, and Seaborn.

Question 1:

When it comes to movies, which genres have been the most popular?

Through the analysis of their IMDB scores, we need to determine how the genres have evolved and which genres have been the most popular throughout the years.

Let’s analyze the genres of the movies over the year using a line plot.

Line Plot is a way to display data along a number line

Since there are about 100+ different genres, we’ll only analyze a few of the common genres which have a film count of greater than 5

The timeline graph shows the IMDB scores of each genre over the years. The top 3 genres with the highest IMDB Scores are Concert Film, Crime Drama, and Documentary.

We can use a pie chart to examine how many films fall into each of these categories.

Pie Chart represents the data in a circular graph. The entire “pie” represents 100 percent of a whole, while the pie “slices” represent portions of the whole

We can see from this pie chart that about 90% of our films are of the documentary genre and is one of the most loved films over the years.

Question 2:

Does the type of movies depend on their release month or not?

How about we try to answer whether the types or genres altered according to when they were published. For example, were Christmas movies primarily released during the month of December when Christmas is close, or were patriotic films released more often during the month of July?

We will analyze the months of the movies related to America’s common public holidays and celebrations such as Christmas, Halloween, Veteran’s Day, Easter, and others.

Since October is Halloween month, we can observe that the majority of horror films were released during this month.

Next, we’ll try to observe Christmas movies. Since there is no specific genre for ‘Christmas’, we will filter out the movies using the Title column which will contain the string ‘Christmas’.

Clearly, all of the Christmas movies have been released in November or December.

Question 3:

How well did movies with language other than English did?

Finally, we’ll see if Netflix should continue to offer non-English language films.

While the majority of non-English films are doing well on IMDB, the Italian and Indonesian films are not.

Netflix should continue to release films in Spanish, Japanese, and Portuguese since they are the finest.

Conclusion

Some really intriguing insights and visuals were presented in the previous section. So many questions remain to be answered regarding this data, but for the time being, just a handful have been addressed.

Some of them are summarised below:

  1. The majority of the films have IMDB ratings ranging from 6.3 to 7.1.
  2. The most popular genre of Netflix Originals is the Documentary genre.
  3. English is the most commonly used language in these films.
  4. The top 3 genres with the highest IMDB Scores are Concert Film, Crime Drama, and Documentary.
  5. Most of the horror films were released during the month of October since it’s the month of Halloween, and most of the Christmas movies were either released in the month of November or December.
  6. Other than English-language films, Spanish, Japanese, and Portuguese films are among the greatest.

--

--