Netflix Data Analysis — part 2: EDA with Pandas and Matplotlib

Published in

Women in Technology

6 min readOct 23, 2023

The ability to extract meaningful insight from a dataset stands as the quintessential superpower of a data enthusiast. Exploratory Data Analysis, or EDA, serves as a vital compass leading data detectives through their vast journey among numbers and patterns.

In the preceding article (you can take a look here), I talked about diving into the cleaning data realm using Python and preparing the groundwork for a robust foundation to propel us into the second phase of this project. Our unbeatable superpower aka data visualization will help us unlock the darkest secrets of our Netflix dataset.

Whether you’re an experienced data scientist, a Python beginner, or anywhere in between, I’d like to invite you to join me on this exciting journey, where together we’ll explore and unveil diverse insights about Netflix.

As we navigate through our EDA, we’ll be addressing questions like:

How many titles has Netflix collected

2. What type of content does Netflix offer to its members

3. Percentage distribution of each type of content

4. What genre is very common on Netflix

5. Number of shows based on the type and rating

6. Which country has the highest production output of movies and TV shows?

7. How long the content is

8. Top 5 directors ranked by the number of titles they have produced

9. How Netflix has evolved, and the number of titles it has amassed from before 2000 to the present

So let’s get started 😃

Import the needed libraries and the dataset into Jupyter

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv(r".../Netflix_Cleaned_Dataset.csv")

The dataset i used is the one i cleaned in the previous article; you can find it on my GitHub.

2. Let’s take a closer look at our dataset

df

Result:

3. After that let’s determine the number of titles that Netflix possesses for our entertainment:

total_titles= df['show_id'].count()
print('Our dataset contains a total number of ' + total_titles.astype(str) + ' titles')

Result:

4. Now, let’s explore the types of content available on Netflix, addressing the second question

df_type = pd.DataFrame(df.groupby('type')['show_id'].count()).reset_index()
df_type.columns = ['Type','Number of titles']

vals = pd.Series(df_type['Number of titles'].values)

colors=['#D00202', '#564d4d']
plt.bar(df_type['Type'],height = df_type['Number of titles'], color = colors)
plt.xlabel('Type')
plt.ylabel('Number of titles')
plt.title('Titles by type')

for index,value in enumerate(vals):
    plt.text(index, value, str(value), ha='center', va='bottom', fontsize= 9, fontweight='semibold')
    
plt.show()

Result:

Here we can see that most of the titles that Netflix has available are movies (around 6131), while there are only 2676 TV Shows available.

5. How those are distributed? (question number 3)

movies_percentage = ((df[df['type'] == 'Movie']['show_id'].count() / total_titles) * 100).round(0).astype(str)
tv_shows_percentage = ((df[df['type'] == 'TV Show']['show_id'].count() / total_titles) * 100).round(0).astype(str)

colors=['#D00202', '#564d4d']
slices = np.array([movies_percentage,tv_shows_percentage])
pie_labels=["Movie","TV Show"]
explode = (0.05, 0.05)
plt.pie(slices, labels=pie_labels , autopct='%1.0f%%', colors= colors, explode=explode, pctdistance=0.80, textprops={'color': 'white',  'weight': 'bold', 'fontsize': 11})

plt.legend(labels=pie_labels, loc='lower left')
plt.title('Netflix dataset contains a percentage of ' + '%d' %movies_percentage.astype(float) + '% Movies and ' + '%d' %tv_shows_percentage.astype(float) + '% TV Shows')
plt.show()

Result:

Approximately 70% of the titles available on Netflix belong to the category of movies, while only around 30% fall under the TV shows category.

6. What genre is very common on Netflix?


df_movies_by_genre = pd.DataFrame(df.groupby('genre')['show_id'].count()).reset_index()
df_movies_by_genre.columns = ['genre','titles_number']
df_movies_by_genre = df_movies_by_genre.sort_values('titles_number',ascending=False).head(5)

plt.bar(df_movies_by_genre['genre'],df_movies_by_genre['titles_number'], color = df_movies_by_genre['color'])
plt.xlabel('Genre')
plt.ylabel('Number of titles')
plt.title('Most favored genres')
plt.xticks(rotation = 45)
plt.show()

Result:

Dramas and Comedies take the top two spots, whereas the bottom three positions are occupied by genres such as Action & Adventure, Documentaries, and International TV Shows.

7. Number of shows based on the type/category and rating

df_titles_by_rating_and_type = pd.DataFrame(df.groupby(['rating','type'])['show_id'].count()) #duc la data cleaning--pt verificare
df_titles_by_rating_and_type

Result:

8. Which country has the highest production output of movies and TV shows?

df_titles_by_country = df.groupby('country')['show_id'].count().reset_index()
df_titles_by_country.columns = ['country', 'title no']

df_titles_by_country['country'] = df_titles_by_country['country'].str.split(',').str[-1]
df_titles_by_country = df_titles_by_country.sort_values(by='title no', ascending= False).head(10)
top_10_countries_desc = df_titles_by_country.sort_values(by='title no', ascending=True)

plt.barh(top_10_countries_desc['country'], top_10_countries_desc['title no'], color='#D00202')
plt.xlabel('Countries')
plt.ylabel('Number of titles')
plt.title('Titles by country')

plt.show()

Result:

Based on the visual representation above we can see that the United States has been a major producer of both movies and TV shows, followed by India and the UK.

9. Duration of content based on type (Movies, TV Shows) — uncovering the top 10 based on the number of titles

df_titles_by_duration = pd.DataFrame(df.groupby(['type','duration'])['show_id'].count()).reset_index()
df_titles_by_duration.columns = ['type','duration','number of titles']
df_titles_by_duration

movies_data = df_titles_by_duration[df_titles_by_duration['type'] == 'Movie']
movies_data_sorted = movies_data.sort_values(by='number of titles', ascending= False).head(10)# de revenit cu sortarea
top_10_movies_desc = movies_data_sorted.sort_values(by='number of titles', ascending=True)

tv_shows_data = df_titles_by_duration[df_titles_by_duration['type']=='TV Show']
tv_show_data_sorted = tv_shows_data.sort_values(by='number of titles', ascending= False).head(10)
top_10_tv_show_desc = tv_show_data_sorted.sort_values(by='number of titles', ascending=True)

colors_movies = top_10_movies_desc['number of titles'].apply(lambda y: '#564d4d' if y < 140 else '#D00202')
colors_tv_shows = top_10_tv_show_desc['number of titles'].apply(lambda y: '#564d4d' if y < 50 else '#D00202')

plt.figure(figsize=(12, 7))

plt.subplot(1, 2, 1) # row 1, col 2 index 1
plt.barh(top_10_movies_desc['duration'], top_10_movies_desc['number of titles'], color=colors_movies)
plt.title("Top 10 Movies Lenght")
plt.xlabel('X-axis ')
plt.ylabel('Y-axis ')

plt.subplot(1, 2, 2) # index 2
plt.barh(top_10_tv_show_desc['duration'], top_10_tv_show_desc['number of titles'], color=colors_tv_shows)
plt.title("Top 10 TV Shows Lenght")
plt.xlabel('X-axis ')
plt.ylabel('Y-axis ')

plt.tight_layout()
plt.show()

Result:

We observe that the majority of movies require a minimum viewing time of 90 minutes, whereas only a few TV shows extend beyond a single season.

10. Top 5 directors ranked by the number of titles they have produced

titles_per_director = df.groupby('director')['show_id'].count().astype(int).reset_index()
titles_per_director.columns = ['Director','Titles No']

director_titles = titles_per_director.sort_values(['Titles No','Director'], ascending=False).iloc[1:6]
director_titles = director_titles.sort_values('Titles No', ascending = True)
plt.barh(director_titles['Director'], director_titles['Titles No'], color = '#D00202')
plt.title('Top 5 directors by number of Titles')
plt.ylabel('Director')
plt.xlabel('Number of Titles')


for index,value in enumerate(director_titles['Titles No']):
    plt.text(value, index, str(value), ha='left', va='center', fontsize= 9, fontweight='semibold')

plt.show()

Result:

In the ranking, it’s obvious that Rajiv Chilaka stands out, he produced around 22 titles, while the others did not exceed 20 titles.

11. How Netflix has evolved

df_titles_per_year = df.groupby('release_year')['show_id'].count().reset_index()
df_titles_per_year.columns = ['Year','Number of titles']

df_titles_after_1997 = df_titles_per_year[(df_titles_per_year['Year'] > 1997)]
max_year = df_titles_per_year['Number of titles'].idxmax()
    
plt.plot(df_titles_after_1997['Year'], df_titles_after_1997['Number of titles'], color='white',linewidth=3)

plt.xlabel('Year')
plt.ylabel('Number of titles')
plt.title('Netflix evolution')
plt.gca().set_facecolor('black')
plt.show()

message = '\033[1m{:^80}'.format('The year with the maximum number of titles is: {}'.format(df_titles_per_year['Year'].iloc[max_year]))

print(message)

Result:

The evolution was continuous, and around 2018, the Netflix collection surpassed 1000 titles.

Conclusion

Based on this analysis we concluded the following:

Netflix content is based more on movies, than TV Shows
The majority of titles are Dramas
The biggest producer is the US, followed by India and the UK
The majority of movies keep us in front of the screen for no less than 90 minutes and TV Shows have only one season
One of the most hard-working directors based on the number of titles they have produced is the CEO of Green Gold Animation Pvt Ltd, Rajiv Chilaka, followed by Raul Campos, Marcus Raboy, Suhas Kadav, and Jay Karas
Netflix’s peak period was 2018, when its collection of movies and TV Shows exceeded 1000 titles.

You can find the complete project on my GitHub repository.

Thank you so much for your support, it means a lot to me.

If you found this article interesting and helpful, you have the option to support my work here ☕😊

P.S.: Visit my medium and embark on an exciting journey of discovery today. Happy reading!

Netflix Data Analysis — part 2: EDA with Pandas and Matplotlib

So let’s get started 😃

Conclusion

Written by Luchiana Dumitrescu