Netflix Clusters

Published in

INST414: Data Science Techniques

4 min readDec 6, 2023

Insight

Netflix is a widely used streaming service that has thousands of movies and TV shows readily available for subscribers to watch. There are several different types of genres and categories to choose from on Netflix, including drama, action, thriller, documentaries, kid’s TV, and many more. Often, people may finish watching a movie or TV show on Netflix and try to find a show or movie similar to the one they just watched. To help recommend new movies and TV shows to subscribers looking for something new to add to their watch list, I wanted to use various features of movies and TV shows to find similarities between them and then cluster them. In addition to these clusters being used to recommend new shows and movies to subscribers, it can also help researchers and analysts study what types of content appeal to different audiences and demographics. For example, a researcher can use the data to predict what show or movie a subscriber is likely to watch next based on their previously watched content.

Data Source and Metrics

To conduct analysis on similar movies and TV shows, I used a pre-existing dataset from Kaggle. The dataset contains 8,807 rows of various movies and TV shows. There are 12 columns, including show ID, type, title, director, cast, country, date added, release year, rating, listed in (category), and description. The features I used to determine similarity were cast, listed in, and description. The similarity metric that I decided to use is cosine similarity. After calculating cosine similarity, I used KMeans to cluster similar movies and TV shows.

K Value

For clustering, I chose a K value of 10, which was pre-defined. I chose a K value of 10 so that there would be enough diversity in what each cluster represents. I also limited the number of titles per cluster to 10 so that each cluster would not be too crowded. Without limiting the number of titles, the cluster would contain hundreds of titles, which would be overwhelming for Netflix subscribers to choose from.

Cluster Representation

Based on the listings and descriptions for each movie/show, each cluster seems to represent a specific theme, genre, or even country. Cluster 0 appears to represent children and family-appropriate content since it commonly has “children” and “family” in the listed_in column. Based on the description and listed_in columns, cluster 2 has more adventurous content, while still leaning towards shows for kids. Cluster 3 mostly represents content that falls into the comedy category, with “comedy” often appearing in the listed_in column. Based on the description and listed_in columns, cluster 4 represents movies and shows that have action, adventure, and some comedy. Cluster 5 was made up of entirely international shows and movies, more specifically Korean shows and Korean movies. Cluster 6 was entirely made up of documentaries about various topics including music, true crime, sports, etc. Cluster 7 also consisted of only documentaries and docuseries, similar to Cluster 6. Cluster 8 represented international content with mostly Spanish-language TV shows, including crime shows and reality shows. Lastly, cluster 9 represented mostly action and adventure content, including a few international shows.

Below is a DataFrame containing each cluster and its respective titles:

Software Used

To facilitate this analysis, I used Pandas, a library in Python. I used Pandas to read the Netflix dataset into a DataFrame and to create a new DataFrame as well. To visualize a target cluster and create a word cloud, I imported matplotlib and WordCloud. I also used SKLearn, a library in Python used for data analysis. From SKLearn, I imported TfidfVectorizer to convert text from the columns into numerical vectors. I also imported cosine_similarity to calculate the similarity between various movies and TV shows. Lastly, I imported KMeans to cluster titles based on the chosen features.

Data Cleaning

The Netflix dataset had multiple columns which were not needed for this analysis. I started the data cleaning process by only selecting the title, cast, listed_in, and description columns. I then used the fillna() method to replace null values with an empty string. The only column that appeared to have null values was the cast column. After replacing null values, I used TfidfVectorizer from SKLearn to convert text into numerical vectors. This allowed for cosine similarity to be applied to columns containing text. Before printing clusters, I limited the number of titles in each cluster to 10, since I did not want the clusters to be too crowded.

One of the errors that I encountered was an issue with combining text and applying TF-IDF. After doing some research and further analyzing the dataset, I realized that the cast column had a lot of null values. This was preventing TF-IDF from working. To fix this issue, I used the fillna() method to replace null values with an empty string.

Word Cloud

Below is a word cloud visualizing common words found in Cluster 6:

Limitations

One of the limitations of this analysis is that the clusters might be a little too small. As a result of the number of titles per cluster being limited to 10, the clusters may not be as representative or diverse. Another limitation of this analysis is that the data used is not up to date. The dataset was last updated 2 years ago, so it does not include anything released in the past 2 years. It also might include shows or movies that have been taken off of Netflix. Subscribers using these clusters may not receive the most updated recommendations.

You can find the code for this analysis here: https://github.com/adasti/INST414

You can find the Netflix dataset here: https://www.kaggle.com/datasets/shivamb/netflix-shows/data