Exploring Netflix Titles: Data Analysis using Python

Marat Pekker
7 min readNov 1, 2023

--

[Importing my older blog posts into Medium]

Introduction

In this blog post, we will explore the Netflix Titles dataset and provide insights into how it can be analyzed. We will cover key concepts and techniques in data analysis, including data cleaning, exploratory data analysis, and data visualization.

We will be using Python and popular data analysis libraries such as Pandas, Numpy and Matplotlib, we will demonstrate how to extract meaningful insights from the Netflix Titles dataset. Whether you’re a data scientist, a business analyst, or just someone interested in the world of entertainment, this post will provide valuable insights into the power of data analysis and its role in shaping the future of streaming entertainment.

Libraries

  1. Pandas is a powerful data analysis library for Python. It provides easy-to-use data structures and data analysis tools for handling and manipulating large datasets, such as those found in the Netflix Titles dataset. With Pandas, you can easily clean, transform, and aggregate data, and perform complex data analysis tasks with ease.
  2. NumPy is a popular numerical computing library for Python. It provides support for large, multi-dimensional arrays and matrices, and includes a wide range of mathematical functions for performing operations on these arrays. NumPy is widely used in scientific computing, data analysis, and machine learning, and is a key tool for working with numerical data in the Netflix Titles dataset.
  3. Matplotlib is a data visualization library for Python. It provides a wide range of visualization tools for creating high-quality charts, graphs, and other visual representations of data. With Matplotlib, you can create line plots, scatter plots, histograms, bar charts, and more, making it an invaluable tool for exploring and presenting insights from the Netflix Titles dataset.
  4. Seaborn is another Python data visualization library built on top of Matplotlib. It provides a high-level interface for creating informative and attractive statistical graphics in Python. Seaborn is particularly well-suited for creating complex visualizations that require multiple layers of information, such as heatmaps, clustermaps, and pair plots.
  5. Seaborn includes a wide range of customization options, such as color palettes, themes, and plot styles, allowing you to create visualizations that are tailored to your specific needs.
  6. Wordcloud package is a Python library for generating word clouds, that are graphical representations of word frequency in a text document.

Dataset

For our analysis of the Netflix Titles, we will be using a readily available dataset that was downloaded from Kaggle, a popular platform for data science and machine learning. The dataset contains information on over 13,000 titles available on Netflix as of September 2021, including movies, TV shows, documentaries, and other content from around the world.

The dataset is provided in a CSV (comma-separated values) file format, which can be easily imported into a data analysis tool like Pandas. The data includes information such as the title, director, cast, country, release year, rating, and more, making it a rich source of information for our analysis.

By using a readily available dataset like this, we can focus our analysis on the key concepts and techniques of data analysis, without the need to spend time collecting and cleaning the data.

Let’s begin!

Import Libraries

First, we would import libraries:

Now we need to get the data. Our dataset filename is ‘netflix_titles.csv’, let’s place it in the same directory as our Python file and load the data.

Load the Data

We need to load the data into dataframe. In computer science and data analysis, a dataframe is a two-dimensional table-like data structure. It is used for storing and manipulating data in a way that is organized into rows and columns.

In Pandas, a dataframe is a key data structure that provides several methods and functions for handling and analyzing data.

df = pd.read_csv('netflix_titles.csv')

We now have our data stored in “df” (dataframe)

Let’s see the number of rows and the number of columns in this dataframe. In other words, we want to see the dimensions of the dataframe.

We could also see the columns of df

If you want to see more information on this df, you could also issue the following command

df.describe

Cleaning the Data

df.isnull().sum()

This command would show us if there are any missing data. We check for NULL values.

Since we already have a release date, we don’t need to have date_added column, so let’s remove it from our dataframe

All titles appear on Netflix US, so we can replace NaN countries with the United States

df['country'].replace(np.nan, 'United States',inplace = True)

In case the rating field is missing, let’s assume it is TV-MA, which stands for a mature audience only

Now we can return a specific number of rows (By default it is 5)

Exploratory Data Analysis

Now that we loaded and cleaned the data, let’s see what can we get from it.

For example, let’s see if Netflix released more TV Shows than movies back in 2020.

movies = df[df['type'] == 'Movie'] movies2020 = movies[movies['release_year'] == 2020] movies2020.count()

Now let’s do the same for TV Shows

tvShows = df[df['type'] == 'TV Show'] tvShows2020 = tvShows[tvShows['release_year'] == 2020] tvShows2020.count()

What is the most common movie category released?

movies.listed_in.value_counts()

What is the most common movie category released in 2020?

movies2020.listed_in.value_counts()

What is the most common TV Shows category released?

tvShows.listed_in.value_counts()

What is the most common TV show category released in 2020?

tvShows2020.listed_in.value_counts()

What are the top 3 countries for movie releases?

movies.country.value_counts().head(3)

What are the top 3 countries for movie releases in 2020?

movies2020.country.value_counts().head(3)

What are the top 3 countries for TV show releases?

tvShows.country.value_counts().head(3)

What are the top 3 countries for TV Show releases in 2020?

tvShows2020.country.value_counts().head(3)

Data Visualization

Let’s build a pie chart that shows a breakdown of content ratings. We want to show percentages with a decimal, for example like this 18.5%, so for that, we would use ‘%1.1f%%’ format

df['rating'].value_counts().plot.pie(autopct='%1.1f%%',shadow=True,figsize=(10,8)) plt.show()

As you can see, this pie chart looks a bit messy, so let’s clean it up by removing values that are less than 0.5% (Note: depending on your project, you would need to decide if this makes sense to remove some of the data. Also, remember that we had to assume that all items where the rating was missing, had a rating of TV-MA — mature content, which might not be true. Also, don’t forget that this dataset is a bit outdated and we use it as an example, so it doesn’t have up-to-date information for 2023)

rating_counts = df['rating'].value_counts() rating_counts_filtered = rating_counts[rating_counts / rating_counts.sum() >= 0.005] rating_counts_filtered.plot.pie(autopct='%1.1f%%', shadow=False,figsize=(10,8)) plt.show()

Let’s create a word cloud of counties for TV Shows that were released back in 2020

plt.subplots(figsize=(25,15)) wordcloud = WordCloud(background_color='white',width=1920,height=1080).generate(" ".join(tvShows2020.country)) plt.imshow(wordcloud) plt.axis('off') plt.show()

Looks great! But if we look closely, we can see that it shows “United States” and “States United” as separate words, you could fix that by combining these into a single word

Show the relation between Type and Rating

plt.figure(figsize=(10,8)) sns.countplot(x='rating',hue='type',data=df) plt.title('Relation between Type and Rating')

Let’s clean it up and remove count values that are less than 5

plt.figure(figsize=(10,8)) sns.countplot(x='rating',hue='type',data=df) plt.title('Relation between Type and Rating for coumts that are more than or equal to 5')

Let’s create a word cloud of cast names

df.dropna(subset=['cast'], inplace=True) plt.subplots(figsize=(25,15)) wordcloud = WordCloud(background_color='white',width=1920,height=1080).generate(" ".join(df1.cast)) plt.imshow(wordcloud) plt.axis('off') plt.savefig('cast.png') plt.show()

Closing notes

In this blog post, we explored the “Netflix Titles” dataset using Python and several libraries including pandas, numpy, matplotlib, seaborn, and wordcloud. We loaded the dataset into a pandas dataframe, cleaned and prepared the data for analysis, and visualized the data using various charts and graphs.

We learned about the different ratings of the shows and movies on Netflix, the most common genres and countries of origin, and the relationship between the type and rating of the content. We also used wordcloud to visualize the most common countries and cast members in the dataset.

Exploring datasets like this can help us gain insights into various trends and patterns in the data, and can also help us make more informed decisions based on the data. With the help of Python and its powerful libraries, we can quickly and easily perform data analysis and visualization tasks on large datasets, making it an essential tool for data scientists and analysts.

Originally published at https://mpdev.hashnode.dev.

--

--