In this article I will show you how to create your very own movie recommendation engine using the Python programming language and machine learning !
About Recommendation Engines
A recommendation engine, also known as a recommender system, is software that analyzes available data to make suggestions for something that a user might be interested in.
A recommendation engine can be used for other things besides movies, like comic books or t-shirts on Amazons e-commerce site that recommends products based on similar customers who bought similar products. It’s the area that you see “Customers who viewed this item also viewed” and “Customers who bought this item also bought” list.
There are basically four types of recommendation engines:
- Content based recommendation engine
- Collaborative filtering based recommendation engine
- Popularity based recommendation engine
- Hybrid recommendation engine
Content based recommendation engine: Content based recommendation engines (the engine that we will use in this article) is a recommendation system that takes content or attributes of a product you like, for example a movies genre, cast, director, keywords etc. , and then ranks other products based on how similar they are to the liked product, in this case we rank different movies based on how similar the recommended movies are to the liked movie using something called similarity scores.
Collaborative filtering based recommendation engine: Collaborative filtering based recommendation engine is a family of algorithms that tries to find similar users based on similar preferences, actions and activities . It then looks at the movies for one user and recommends it to a similar user. Let’s take for example user A who is similar to user B, we know they are both similar because they both like the same video games, comic books, etc., if user A has seen a movie that user B hasn’t, then this recommendation engine will recommend that movie to user B.
Popularity based recommendation engine: Popularity based recommendation engine is a recommendation engine based off of how popular some product or item is. For example a popular based recommendation engine would take the view counts for a movie or video and then list the movie or video from the highest view count to the lowest view count. Netflix and YouTube trending list uses this type of recommendation engine or at least a similar one. This is also considered one of the simplest recommendation engines to implement.
Hybrid recommendation engine: Hybrid recommendation systems are a combination of two or more types of recommendation systems, and can be more effective then using the engines separately according to recent research. It is likely that Google uses this type of recommendation engine to find similar movies .
In this article we will be creating a content based recommendation engine using Python and machine learning.
If you prefer not to read this article and would like a video representation of it, you can check out the video below. It goes through everything in this article, and will help make it easy for you to start programming your own movie recommendation engine even if you don’t have the programming language Python installed on your computer. Or you can use both the video and this article as supplementary materials for learning!
How To Find The Movie Similarity ?
The type of recommendation engine that will be created in this article is the content based recommendation engine. This means that we need to find similar movies to a given movie that a user likes and then recommend those similar movies to the user. But how ? How do we know the movies that are similar and how do we know how similar the movie is ?
To explain this I will start with an example. Let’s say we have some text summary from movie A and movie B.
Text From Movie A: ‘Amazing Spiderman Amazing’
Text From Movie B: ‘Spiderman Spiderman Amazing ’
I know Spider-Man is spelled with a dash in it, but for simplicity sake I will use ‘Spiderman’ instead.
Let’s examine how similar the text from each movie is.
- Text From Movie A: Contains the word “Amazing” twice and the word “Spiderman” one time.
- Text From Movie B: Contains the word “Amazing” one time and the word “Spiderman” twice.
Now, let’s plot this word count on a 2-Dimensional graph. Text From Movie A will have the point (1,2) and Text From Movie B will have the point (2,1) where the X-axis on the graph is the number of times the word “Spiderman” appears and the Y-axis is the number of times the word “Amazing” appears. The origin point for both vectors is (0,0).
We can change text to a similar vector of word counts by using a CountVectorizer function or just by doing what we did above.
Now the two texts have been converted to vectors and the closer the vectors angular distance are, the more similar they are. So we can simply get the angular distance which is called theta and represented by the symbol θ to find the similarity between the two vectors.
When thinking in terms of probability, and likelihood it makes sense to use cos θ to get the similarity of the two vectors, this ensures that the value returned is between 0 and 1 since cos 90° = 0 and cos 0° = 1, so the values are in the range from 0 to 1 just like probabilities.
Now we understand how to get similarities in 2-Dimensions for text represented as vectors and this method can be used for N-Dimensions as well where N is an arbitrary positive integer. So in summary, we can get the similarity of text by changing the text into vectors and getting the angular distance (θ) between values 0 and 1 using cos θ and ultimately getting a similarity value between 0 and 1.
How To Find Similarity Using Python ?
We need to get Text From Movie A and Text From Movie B in our program:
We can use the class
CountVectorizer() to convert the text to vectors.
Now the text has been converted to a count matrix
cm . Print the features or columns or word list before printing out the count matrix.
The printed output of the code above is below.
The statement above shows that the word ‘Amazing’ occurs twice in the text from movie A and one time in the text from movie B. Similarly, the word ‘Spiderman’ occurs one time in the text from movie A and twice in the text from movie B.
cosine_similarity() function will give us a similarity score between these vectors and show us how similar they are to each other.
The code above will output a similarity matrix, which will look like the statement below.
The output above shows the similarity score between the two text. The Text From Movie A is row one and contains the similarity scores [1, 0.8] in that order. The Text From Movie B is the second row and contains the similarity scores [0.8, 1] in that order.
The first column represents Text From Movie A. This means that the first row (Text From Movie A) similarity score with the first column (Text From Movie A) is 1 or 100% , this is position [0,0].
The second column represents Text From Movie B. This means that the first row (Text From Movie A) similarity score with the second column (Text From Movie B) is 0.8 or 80% , this is position [0,1].
This can be visualized like the below :
Text From Movie A: Text From Movie B:
Text From Movie A: [[1. 0.8]
Text From Movie B: [0.8 1.]]
Now that we have a better understanding of cosine similarity , converting text to vectors and getting the similarity scores between them, let’s start programming the movie recommendation engine !
Programming The Movie Recommendation Engine:
The first thing that I like to do before writing a single line of code is to put in a description in comments of what the code does. This way I can look back on my code and know exactly what it does.
#Description: Build a movie recommendation engine (more specifically a content based recommendation engine)
Next, we will import the dependencies.
Pandas will be used to read in the data,
numpy will be used to support multi-dimensional arrays and matrices,
sklearn will be used to get the
CountVectorizer() class and the
#Import the libraries
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
I am using Google Colab so I need to use Google’s library to upload the data set.
#Load the data
from google.colab import files
uploaded = files.upload()
Load the data from the data set.
df = pd.read_csv('movie_data.csv')
Print the first 3 rows from the data set.
df.head(3) #Print the first 3rows
Get the number of rows and columns in the data set.
Now, I can see that this data set contains 1,000 movies.
Create and print a list of important columns to use (the main content of the movie). We don’t need the rest of the columns in the data set.
#Create a list of important columns
columns = ['Actors', 'Director', 'Genre', 'Title']df[columns].head(3)
Check for any null values for the specific columns that we are interested in.
Create a function to combine the values of the important columns into a single string.
important_features = 
for i in range(0, data.shape):
important_features.append(data['Actors'][i]+ ' '+data['Director'][i]+ ' '+data['Genre'][i]+' '+data['Title'][i] )return important_features
Apply the function to each row in the data set to store the combined strings into a new column.
df['important_features'] = get_important_features(df)
Print the data frame to show the new column.
Convert the text from the new column to a matrix/vector of word counts, and store it into a variable called
cm = CountVectorizer().fit_transform(df['important_features'])
Get the cosine similarity matrix from the count matrix. This will give us a similarity score for each movie (row of data) to every other movie in the data set (the columns) including itself.
#Get the cosine similarity matrix from the count matrix
cs = cosine_similarity(cm)
#Print the similarity score
Get the number of rows and columns in the cosine similarity matrix. This allows us to see the number of rows and columns in this data set. The number of rows and columns should both be equal to the number of movies in the original data set.
A value in a row corresponds to the similarity of that movie in that row to the movie represented by the columns. For example the movie in row position 0 is also represented by the column at position 0, the movie in row position 1 is also represented by the column at position 1 and so on and so forth.
The similarity score for row at position 0 to the column at position 0 should always be 1 since they are the same movie. The row at position 0 similarity value may be different for the column at position 5, depending on how similar the movie in row 5 and row 0 are.
Get the title of the movie that the user likes and store it into a variable.
title = 'The Amazing Spider-Man'
Find the row id / movie id of the movie the user likes and store it into a variable.
movie_id = df[df.Title == title]['Movie_id'].values
Create a list of tuples in the form (movie id, similarity score). The below will take the array/list of similarity scores like
[0.1,0.3, 0.8] and return a list of tuples in the form (movie id, similarity score) like
[(0, 0.1),(1, 0.3), (2, 0.8)] .
scores = list(enumerate(cs[movie_id]))
Print the similar movies list, (movie id, similarity score).
Sort the list of similar movies according to the similarity scores in descending order. Since the most similar movie is itself, we will discard the first element after sorting.The lambda function get’s element ‘x’ and returns ‘x’ which is the similarity score from the tuple.
sorted_scores= sorted(scores,key=lambda x:x,reverse=True)
sorted_scores = sorted_scores[1:]
Print the sorted list.
Create a loop to print the first 7 movies from the sorted similar movies list.
#Create a loop to print the first 7 movies from the sorted similar movies list
print('The 7 most recommended movies to', title, 'are:\n')
for item in sorted_similar_movies:
movie_title = df[df.Movie_id == item]['Title'].values
j = j+1
if j > 6:
Conclusion and Resources
That is it, you are done creating your movie recommendation program !
If you are interested in reading about machine learning to immediately get started with problems and examples, I recommend you read Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems.
It is a great book for helping beginners learn to write machine-learning programs and understanding machine-learning concepts.
Thanks for reading this article, I hope it’s helpful to you!