Finding Similar Netflix Films/Shows

Kimberly Escobar
INST414: Data Science Techniques
4 min readDec 6, 2023

After finding a dataset on kaggle.com containing information about all movies and tv shows on Netflix, I decided to try and create my own Netflix ranking system. My goal was to be able to list the top ten most similar movies or shows to a target title.

The specific insight I hoped to extract was identifying similar movies/tv shows based on the following features: cast, director, rating, and genres. This helps those identify what to watch next on Netflix should they choose to watch something similar to a desired movie/tv show like “Grown Ups.” It mainly helps inform those who are not as knowledgable in films to expand their options on what to watch next.

Data Collection and Cleaning

I obtained this dataset from Kaggle. With over 8000 tv shows and movies, I decided to keep both types of motion picture for this analysis to broaden the search for similarities. The original dataset did not need much cleaning, but I did have to fill all empty/NaN cells with an empty string. I also added some columns for my analysis process.

One note to make is that since the data contains both tv shows and movies, the “duration” section does not calculate total minutes for tv shows like it does for movies. It will only mentions the total number of seasons for a tv show.

Finding Similarity

I decided to find similarities between the following features: cast, director, rating, and genres. I originally also wanted to include the tv show/movie description, but this was recorded as a short paragraph in the dataset. Instead of risking further bias in this analysis and forming keywords from each description, I did not include this feature in my final analysis. I also used the cosine similarity metric to compare all titles.

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#Combine all features to compare titles with
def combined_features(row):
return row['cast']+" "+row['director']+", "+row['rating']+", "+row['listed_in']

df['combined_features'] = df.apply(combined_features, axis=1)

#Count number of features present in each title
cv = CountVectorizer()
count_matrix = cv.fit_transform(df['combined_features'])
print("Count Matrix: ", count_matrix.toarray())

#Set similarity metric
cosine_sim = cosine_similarity(count_matrix)

As shown above, I used the pandas and sklearn Python libraries to complete this analysis. To populate the similarity matrix for all movies/tv shows, I created a new column combining all values from the selected features into one list of strings (for each row). This is how the matrix counted the relevant features for each element.

Testing Three Queries

Below is code I used to execute a query.

#Set target show
target_title = 'Grown Ups'

#Show target features
list(df[df['title']==target_title]['combined_features'])

#Get movie index
def get_index_from(title):
return df[df['title'] == title]['index'].values[0]

movie_index = get_index_from(target_title)

#Generate distances from target title to all other titles
similar_movies = list(enumerate(cosine_sim[movie_index]))

#Sort to get most similar titles first
sorted_similar_movies = sorted(similar_movies, key = lambda x:x[1], reverse = True)

def get_title_from_index(index):
return df[df.index == index]["title"].values[0]

#Print top ten most similar titles
i=0
for movies in sorted_similar_movies:
print(get_title_from_index(movies[0]), movies[1])
i = i+1;
if i>10:
break

I created queries and compared all other movies/tv shows on Netflix to the following three titles: “Grown Ups,” “The Flash,” “Insidious.” Below are the top ten most similar movies/tv shows to each.

Grown Ups 0.9999999999999997
Hubie Halloween 0.4125143236626951
Sandy Wexler 0.3335621924974955
50 First Dates 0.32732683535398854
Big Daddy 0.32142857142857134
Beverly Hills Ninja 0.2760262237369417
Hotel Transylvania 3: Summer Vacation 0.2760262237369417
You Don't Mess with the Zohan 0.2594372608313854
How to Be a Latin Lover 0.2593756879669057
The Wrong Missy 0.2545875386086578
The Last Days 0.253546276418555
The Flash 1.0000000000000004
The Umbrella Academy 0.6487446070815475
Motown Magic 0.5883484054145521
Border Security: America's Front Line 0.5773502691896258
Pioneers: First Women Filmmakers* 0.5773502691896258
The Mole 0.5602794333886092
Anjaan: Rural Myths 0.5554920598635309
L.A.’s Finest 0.5547001962252291
Khan: No. 1 Crime Hunter 0.5547001962252291
The Disappearance of Madeleine McCann 0.5499999999999999
Making a Murderer 0.5499999999999999
Insidious 1.0000000000000007
Creep 0.2910427500435996
The Conjuring 0.28
The Boy 0.27456258919345766
The Last Days 0.2683281572999748
Sweetheart 0.2683281572999748
Rising Phoenix 0.26666666666666666
Apollo 18 0.2618614682831909
The Conjuring 2 0.2514474228374849
In the Tall Grass 0.25021729686848976
In The Deep 0.25021729686848976

Limitations

One accuracy limitation is the missing feature of user ratings. Having this feature included in the dataset would have helped the accuracy of listing similar titles. Another aspect missing from this analysis that I feel is important to include is the description of each movie. While comparing genres and cast members may help find similar movies/tv shows, I do believe that this may not as much of an accurate list as if it were to also compare plots/descriptions.

One example of this is the output for comparing all movies/tv shows to “The Flash.” While most of the titles are known to be similar to “The Flash,” I do not believe this is as accurate as the output for the movie, “Grown Ups.” “The cast in Grown Ups have acted in many movies together — most of which are of similar genres. This is not the case for “The Flash.” “The Flash” has a more diverse cast who act in many different kinds of movies/tv shows. Also, it is listed under many different genres that may group the tv show with other films that in reality are not very similar (Like the “The Disappearance of Madeleine McCann”).

GitHub Repository

--

--