Build A Movie Recommendation Engine Using Python

randerson112358
Aug 21, 2019 · 11 min read
Image for post
Image for post

In this article I will show you how to create your very own movie recommendation engine using the Python programming language and machine learning !

About Recommendation Engines

A recommendation engine, also known as a recommender system, is software that analyzes available data to make suggestions for something that a user might be interested in.

A recommendation engine can be used for other things besides movies, like comic books or t-shirts on Amazons e-commerce site that recommends products based on similar customers who bought similar products. It’s the area that you see “Customers who viewed this item also viewed” and “Customers who bought this item also bought” list.

Image for post
Image for post
Image for post
Image for post

There are basically four types of recommendation engines:

  1. Content based recommendation engine
  2. Collaborative filtering based recommendation engine
  3. Popularity based recommendation engine
  4. Hybrid recommendation engine

Content based recommendation engine: Content based recommendation engines (the engine that we will use in this article) is a recommendation system that takes content or attributes of a product you like, for example a movies genre, cast, director, keywords etc. , and then ranks other products based on how similar they are to the liked product, in this case we rank different movies based on how similar the recommended movies are to the liked movie using something called similarity scores.

Collaborative filtering based recommendation engine: Collaborative filtering based recommendation engine is a family of algorithms that tries to find similar users based on similar preferences, actions and activities . It then looks at the movies for one user and recommends it to a similar user. Let’s take for example user A who is similar to user B, we know they are both similar because they both like the same video games, comic books, etc., if user A has seen a movie that user B hasn’t, then this recommendation engine will recommend that movie to user B.

Image for post

Popularity based recommendation engine: Popularity based recommendation engine is a recommendation engine based off of how popular some product or item is. For example a popular based recommendation engine would take the view counts for a movie or video and then list the movie or video from the highest view count to the lowest view count. Netflix and YouTube trending list uses this type of recommendation engine or at least a similar one. This is also considered one of the simplest recommendation engines to implement.

Image for post
Image for post

Hybrid recommendation engine: Hybrid recommendation systems are a combination of two or more types of recommendation systems, and can be more effective then using the engines separately according to recent research. It is likely that Google uses this type of recommendation engine to find similar movies .

Image for post
Image for post

In this article we will be creating a content based recommendation engine using Python and machine learning.

If you prefer not to read this article and would like a video representation of it, you can check out the video below. It goes through everything in this article, and will help make it easy for you to start programming your own movie recommendation engine even if you don’t have the programming language Python installed on your computer. Or you can use both the video and this article as supplementary materials for learning!

How To Find The Movie Similarity ?

The type of recommendation engine that will be created in this article is the content based recommendation engine. This means that we need to find similar movies to a given movie that a user likes and then recommend those similar movies to the user. But how ? How do we know the movies that are similar and how do we know how similar the movie is ?

To explain this I will start with an example. Let’s say we have some text summary from movie A and movie B.

Text From Movie A: ‘Amazing Spiderman Amazing’

Text From Movie B: ‘Spiderman Spiderman Amazing ’

I know Spider-Man is spelled with a dash in it, but for simplicity sake I will use ‘Spiderman’ instead.

Let’s examine how similar the text from each movie is.

  1. Text From Movie A: Contains the word “Amazing” twice and the word “Spiderman” one time.
  2. Text From Movie B: Contains the word “Amazing” one time and the word “Spiderman” twice.

Now, let’s plot this word count on a 2-Dimensional graph. Text From Movie A will have the point (1,2) and Text From Movie B will have the point (2,1) where the X-axis on the graph is the number of times the word “Spiderman” appears and the Y-axis is the number of times the word “Amazing” appears. The origin point for both vectors is (0,0).

We can change text to a similar vector of word counts by using a CountVectorizer function or just by doing what we did above.

Image for post
Image for post

Now the two texts have been converted to vectors and the closer the vectors angular distance are, the more similar they are. So we can simply get the angular distance which is called theta and represented by the symbol θ to find the similarity between the two vectors.

When thinking in terms of probability, and likelihood it makes sense to use cos θ to get the similarity of the two vectors, this ensures that the value returned is between 0 and 1 since cos 90° = 0 and cos 0° = 1, so the values are in the range from 0 to 1 just like probabilities.

Now we understand how to get similarities in 2-Dimensions for text represented as vectors and this method can be used for N-Dimensions as well where N is an arbitrary positive integer. So in summary, we can get the similarity of text by changing the text into vectors and getting the angular distance (θ) between values 0 and 1 using cos θ and ultimately getting a similarity value between 0 and 1.

Image for post
Image for post

How To Find Similarity Using Python ?

We need to get Text From Movie A and Text From Movie B in our program:

Image for post
Image for post

We can use the class CountVectorizer() to convert the text to vectors.

Image for post
Image for post

Now the text has been converted to a count matrix cm . Print the features or columns or word list before printing out the count matrix.

Image for post
Image for post

The printed output of the code above is below.

Image for post
Image for post

The statement above shows that the word ‘Amazing’ occurs twice in the text from movie A and one time in the text from movie B. Similarly, the word ‘Spiderman’ occurs one time in the text from movie A and twice in the text from movie B.

Thecosine_similarity() function will give us a similarity score between these vectors and show us how similar they are to each other.

Image for post
Image for post

The code above will output a similarity matrix, which will look like the statement below.

Image for post
Image for post

The output above shows the similarity score between the two text. The Text From Movie A is row one and contains the similarity scores [1, 0.8] in that order. The Text From Movie B is the second row and contains the similarity scores [0.8, 1] in that order.

The first column represents Text From Movie A. This means that the first row (Text From Movie A) similarity score with the first column (Text From Movie A) is 1 or 100% , this is position [0,0].

The second column represents Text From Movie B. This means that the first row (Text From Movie A) similarity score with the second column (Text From Movie B) is 0.8 or 80% , this is position [0,1].

This can be visualized like the below :

                 Text From Movie A:     Text From Movie B:
Text From Movie A: [[1. 0.8]
Text From Movie B: [0.8 1.]]

Now that we have a better understanding of cosine similarity , converting text to vectors and getting the similarity scores between them, let’s start programming the movie recommendation engine !

Programming The Movie Recommendation Engine:

The first thing that I like to do before writing a single line of code is to put in a description in comments of what the code does. This way I can look back on my code and know exactly what it does.

#Description: Build a movie recommendation engine (more specifically a content based recommendation engine)

Next, we will import the dependencies. Pandas will be used to read in the data, numpy will be used to support multi-dimensional arrays and matrices, sklearn will be used to get the CountVectorizer() class and the cosine_similarity() function.

#Import the libraries
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

I am using Google Colab so I need to use Google’s library to upload the data set.

#Load the data 
from google.colab import files
uploaded = files.upload()

Load the data from the data set.

df = pd.read_csv('movie_data.csv')

Print the first 3 rows from the data set.

df.head(3) #Print the first 3rows
Image for post
Image for post

Get the number of rows and columns in the data set.

df.shape
Image for post
Image for post

Now, I can see that this data set contains 1,000 movies.

Create and print a list of important columns to use (the main content of the movie). We don’t need the rest of the columns in the data set.

#Create a list of important columns 
columns = ['Actors', 'Director', 'Genre', 'Title']
df[columns].head(3)
Image for post
Image for post

Check for any null values for the specific columns that we are interested in.

df[columns].isnull().values.any()
Image for post
Image for post

Create a function to combine the values of the important columns into a single string.


def get_important_features(data):
important_features = []
for i in range(0, data.shape[0]):
important_features.append(data['Actors'][i]+ ' '+data['Director'][i]+ ' '+data['Genre'][i]+' '+data['Title'][i] )
return important_features

Apply the function to each row in the data set to store the combined strings into a new column.

df['important_features'] = get_important_features(df)

Print the data frame to show the new column.

df.head(3)
Image for post
Image for post

Convert the text from the new column to a matrix/vector of word counts, and store it into a variable called cm .

cm = CountVectorizer().fit_transform(df['important_features'])

Get the cosine similarity matrix from the count matrix. This will give us a similarity score for each movie (row of data) to every other movie in the data set (the columns) including itself.

#Get the cosine similarity matrix from the count matrix 
cs = cosine_similarity(cm)
#Print the similarity score
print(cs)
Image for post
Image for post

Get the number of rows and columns in the cosine similarity matrix. This allows us to see the number of rows and columns in this data set. The number of rows and columns should both be equal to the number of movies in the original data set.

A value in a row corresponds to the similarity of that movie in that row to the movie represented by the columns. For example the movie in row position 0 is also represented by the column at position 0, the movie in row position 1 is also represented by the column at position 1 and so on and so forth.

The similarity score for row at position 0 to the column at position 0 should always be 1 since they are the same movie. The row at position 0 similarity value may be different for the column at position 5, depending on how similar the movie in row 5 and row 0 are.

cs.shape
Image for post
Image for post

Get the title of the movie that the user likes and store it into a variable.

title = 'The Amazing Spider-Man'

Find the row id / movie id of the movie the user likes and store it into a variable.

movie_id = df[df.Title == title]['Movie_id'].values[0]

Create a list of tuples in the form (movie id, similarity score). The below will take the array/list of similarity scores like [0.1,0.3, 0.8] and return a list of tuples in the form (movie id, similarity score) like [(0, 0.1),(1, 0.3), (2, 0.8)] .


scores = list(enumerate(cs[movie_id]))

Print the similar movies list, (movie id, similarity score).

print(scores)
Image for post
Image for post

Sort the list of similar movies according to the similarity scores in descending order. Since the most similar movie is itself, we will discard the first element after sorting.The lambda function get’s element ‘x’ and returns ‘x[1]’ which is the similarity score from the tuple.

sorted_scores= sorted(scores,key=lambda x:x[1],reverse=True)
sorted_scores = sorted_scores[1:]

Print the sorted list.

print(sorted_scores)

Create a loop to print the first 7 movies from the sorted similar movies list.

#Create a loop to print the first 7 movies from the sorted similar movies list
j=0
print('The 7 most recommended movies to', title, 'are:\n')
for item in sorted_similar_movies:
movie_title = df[df.Movie_id == item[0]]['Title'].values[0]
print(j+1,movie_title)
j = j+1
if j > 6:
break
Image for post
Image for post

Conclusion and Resources

That is it, you are done creating your movie recommendation program !

If you are interested in reading about machine learning to immediately get started with problems and examples, I recommend you read Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems.

It is a great book for helping beginners learn to write machine-learning programs and understanding machine-learning concepts.

Image for post
Image for post

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Thanks for reading this article, I hope it’s helpful to you!

Other Resources:

  1. What are Product Recommendation Engines? And the various versions of them?
  2. Comprehensive Guide to build a Recommendation Engine from scratch (in Python)
  3. An Introduction To Recommendation Engines
  4. Content-based Filtering

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

randerson112358

Written by

A programmer that loves Computer Science: https://www.youtube.com/user/randerson112358 https://www.youtube.com/channel/UCbmb5IoBtHZTpYZCDBOC1

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

randerson112358

Written by

A programmer that loves Computer Science: https://www.youtube.com/user/randerson112358 https://www.youtube.com/channel/UCbmb5IoBtHZTpYZCDBOC1

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store