Building a Movie Recommendation Engine in Python

Ansh Bordia
Analytics Vidhya
Published in
5 min readSep 26, 2020

Build your very own movie recommender in a matter of minutes using Python and SentenceBERT.

Photo by Romain MATHON on Unsplash

While browsing through Netflix or Prime Video, you must have come across a section which suggests you movies and TV shows based on your watch history or what is trending in your location at the moment. Ever wondered how this is done? Using a movie recommendation engine.

In this post, I will show you how to build your very own movie recommender from scratch. There will be some theory to help you understand what is being done and a whole lot of hands-on work. So let’s get started!

Typically, a movie recommender is one of the following types:

  1. Recommendation Based — A simple system based on the most popular videos in the user’s region. Yes, you guessed it — The YouTube trending page.
  2. Content Based — Finds similar movies based on metrics like movie plot, genre, actors, language etc.
  3. Collaborative Filtering — Based on the watch history/pattern of a user. For example, Netflix uses a combination of both Content Based & Collaborative Filtering.

Now let’s get started with our recommender! It will be a content based one and given a particular movie, will suggest similar movies based on the given movie’s overview.

Dataset

We will use ‘The Movies Dataset’ on Kaggle. There’s no need to download the entire dataset; just download the ‘movies_metadata.csv’ file. This file contains a range of information on over 45k+ movies such as the movie name, rating, genre, overview, actors and much more. We load the file using the following code:

import pandas as pd
movies = pd.read_csv("movies_metadata.csv", usecols = [5,9,20])
movies_head()
Figure 1: Movies Dataset first 5 movies

Before we progress to building our recommender, let’s do some clean up of the dataset. There are some movies which do not have an overview (Figure 2). We also number the movies in the dataset by assigning them indexes. The for loop in the code below is basically assigning a number (starting from 0 to the number of movies) to each movie in the dataset. We will also define two helper functions to retrieve the name of a movie and its associated index in the dataset. These functions will be used later by the recommender when suggesting movies.

def get_title(index):
return movies[movies.index == index]["title"].values[0]
def get_index(title):
return movies[movies.title == title]["index"].values[0]
movies['index'] = [i for i in range(0, len(movies))]movies = movies.dropna()
Figure 2: NaN/Null Values

Recommendation Engine

Now that we have our dataset ready, lets start building our model. But before the coding, let’s understand how we are going to do this. We will be making use of sentence embeddings for this task. Sentence embeddings help us represent one or more sentences (in our case movie overviews) and their semantic information as vectors. In simple words, these vectors represent the meaning, context and other subtle nuances of the the movie overview.

We will use the much acclaimed SentenceBERT model to get our sentence embeddings. To get a basic understanding of this model refer to this article and this research paper for an in-depth understanding.

Ok, so back to coding now! We now load the SentenceBERT model. The parameter inside SentenceTransformer() is the name of a pre-trained model. A full list of pre-trained models can be found here.

Note: The download and installation process can take a while.

!pip install sentence-transformers
from sentence_transformers import SentenceTransformer
bert = SentenceTransformer('bert-base-nli-mean-tokens')

Now we use the this model to get the vectors for the movies in our dataset.

sentence_embeddings = bert.encode(movies['overview'].tolist())
Figure 3: The Embeddings for the movie overviews

Now that we have our vectors, we need to compute the similarity between them. How do we do that? Simple, using cosine similarity. Using cosine similarity, we can get a similarity score between two vectors. A score of 0 indicates no similarity, while a score of 1 indicates complete similarity. Using the code below, we compute the similarity score between each and every movie in our dataset.

similarity = cosine_similarity(sentence_embeddings)

The similarity variable is a 2D array. Each row corresponds to a movie and contains the similarity of that movie with all the other movies in the dataset.

Voila, that’s it. Congratulations, your recommender is now ready! Let’s test it in action.

notOver = True
while(notOver):
user_movie = input("Enter the movie for which you want recommendations: ")
# Generate Recommendations
recommendations = sorted(list(enumerate(similarity[get_index(user_movie)])), key = lambda x:x[1], reverse = True)
print("The top 3 recommendations for" + " " + user_movie + " " + "are: ")
print(get_title(recommendations[1][0]), get_title(recommendations[2][0]), get_title(recommendations[3][0]), sep = "\n")
decision = input("Press 1 to enter another movie, 0 to exit")
if(int(decision) == 0):
print("Bye")
notOver = False

The above code finds the 3 most similar movies in the dataset to the movie we input.

Note: The movie you input must be present in the dataset. Let’s see what recommendations we get for ‘Toy Story’.

Figure 4: Similar movies to Toy Story

The recommendations look pretty decent here for our model! Both ‘Candleshow’ and ‘Snow White’ are movies for kids. The seconds suggestion is a comedy which is what Toy Story is as well (well, partly). Go give other movies a try and you will be surprised to see how well this model recommends!

Now this is not the end. You can build much more sophisticated recommenders that take into account other metrics like movie cast, language, duration, genre or even your past watch history.

Thanks for reading this post and feel free to use the code in my Jupyter Notebook if you need any help. Cheers!

--

--

Ansh Bordia
Analytics Vidhya

I am a Data Scientist making impactful ML products