Netflix Data Analysis— part 1: Data Cleaning with Python

Published in

Women in Technology

5 min readOct 15, 2023

In this era of AI and streaming platforms, Netflix has become a leading global provider of entertainment that we consume every day, often accompanied by something sweet and a drink. With its vast reservoir of shows and movies, we have endless options and can enjoy a good and long-awaited marathon after a long week.

‘Just one more episode before bed ‘— it sounds pretty familiar right? 😂

As you already know, Netflix is everywhere around us, but what we don’t know is its evolution over time. So in this article, we will entertain ourselves with the trail of valuable data that Netflix provides us besides various collections of titles.

Let me put on my detective hat and dive into the mystery of the Netflix world together with the help of Python’s powerful libraries such as Pandas, Numpy, and Matplotlib. In the upcoming section, we’ll carefully clean and prepare our dataset for the next big step — exploratory data analysis (EDA).

Are you prepared? Get ready, start!

import the required libraries and the dataset into the Jupyter environment

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df= pd.read_csv(r"..\netflix_titles.csv")

2. verify the successful loading of the dataset by displaying the data

df

3. inspect the shape of the dataset

# This function shows us how many rows and columns our dataset has
df.shape

We’ll see that our dataset has 8807 rows and 12 columns.

4. see some more info()

df.info()

The structure of the columns in our dataset is as follows:

The fun part now begins 😁

verify the presence of duplicates in the dataset by applying the duplicated() method:

The result is 0, it means there are no duplicate values in the dataset.

2. determine the number of missing values in the dataset by applying the isna() function

df.isna().sum()

Since there were some NULL values in the rating column, i decided to replace them.

see the rows with NULL values in the rating column:

df[df['rating'].isna()]

replacing them using their show_id with loc:

df.loc[df['show_id'] == 's5990', 'rating'] = 'PG-13'
df.loc[df['show_id'] == 's6828', 'rating'] = 'PG-13'
df.loc[df['show_id'] == 's7313', 'rating'] = 'PG'
df.loc[df['show_id'] == 's7538', 'rating'] = 'PG-13' 

#see the rows after replace action
df[df['show_id'].isin(['s5990','s6828','s7313','s7538'])]

Check if the values have been replaced using the isin() method (it acts like an IN in SQL)

3. replace all NULL values in the rest of the column

As we have already seen, we have missing values in columns such as director, cast, country, and date_added; we’ll replace them with generic placeholders as below

df['director'] = df['director'].fillna('Unspecified')
df['cast'] = df['cast'].fillna('Unknown')
df['country'] = df['country'].fillna(df['country'].mode()[0])
df['date_added'] = df['date_added'].fillna(df['date_added'].mode()[0])
df['duration']= df['duration'].fillna(df['duration'].mode()[0])

and check again for missing values:

4. In the show_id column, I observed that the values are prefixed (the values start with an “s”), so I removed it using the replace() function:

df['show_id'] = df['show_id'].str.replace('s','')
df

Much better, right? I think so too 😉

5. in the cast column, which contains a string of named actors who played in each show/movie, i wanted to keep the main actor’s name, so i used the split() method to keep the first value before the comma:

df['cast'] = df['cast'].str.split(',').str[0]

This allows me to focus specifically on the main actors in our analysis.

6. For the sake of simplicity and clarity in our analysis i decided to rename cast, and listed_in columns using rename():

I performed the column renaming directly within the same dataset using inplace=True

7. throughout these processes, i noticed that the rating column contained some values related to the duration of shows/movies so I replaced them too:

Along with keeping the main director in the director column:

8. In the end I have exported the final version in CSV format:

df.to_csv('my_path/Netflix_Cleaned_Dataset.csv', index=False)

Here is how the dataset looks after all the cleaning and preprocessing steps:

Conclusion

Going into the depths of this dataset was an interesting journey (takes the detective hat off), but the best is yet to come. I just scratched the surface and ensured that my analysis will be built on a solid foundation.

As I set the stage, the next step is going to another exciting exploration in my analysis — EDA using Pandas and Matplotlib libraries. But till then stay tuned

You can find the complete project on my GitHub repository.

Thank you so much for your support, it means a lot to me.

If you found this article interesting and helpful, you have the option to support my work here ☕😊

P.S.: Visit my medium and embark on an exciting journey of discovery today. Happy reading!

Netflix Data Analysis— part 1: Data Cleaning with Python

Conclusion

Written by Luchiana Dumitrescu