Beginner’s guide to build Recommendation Engine in Python

Published in

The Startup

12 min readMay 4, 2020

Introduction

A while ago whenever we bought a specific product, it was probably recommended by our friends or trusted persons. But now the scenario has changed, what the product recommendations on amazon or movie recommendations on Netflix we are getting that are basically on our own interests. By analyzing the customer’s current site usage and his previous browsing history, a recommendation engine studies customer behavior and his interests. Based on this information it is able to deliver relevant product recommendations. The data is collected in real-time so that when the customer’s interest changes, our recommendation model also get updated with the data, and so on recommendations to the customer also changes.

Naturally, it is very feasible for every customer to get product recommendations based on his interests. Because that takes less time to deal with multiple choices and purchase the best one. The recommendation engine is the most widely used application of machine learning, which leads to company productivity and marketing growth.

Here are a few Recommendation System Benefits for your business:-

Drive Traffic to your website.
Deliver Relevant Content to customers.
Create Customer Satisfaction.
Generate Revenue.
Reduce Workload and Overhead on your staff and budget.

In this article, we will go through various types of recommendation engine algorithms and the fundamentals of designing them in python. We will see different types of recommendation engines along we will create our own movie recommendation engine using different approaches and recommending some meaningful choices.

What is a Recommendation Engine?
Content-based Filtering
Collaborative Filtering
Hybrid Recommendation System
Matrix Factorization
Recommendation Engine Model
Observations
Conclusion

Let’s understand the power of recommendation engine,

1. What is a Recommendation Engine?

The recommendation engine is used to give the recommendations of products that customers might wish to purchase based on the customer’s behavior, interests, browsing history, and similarity with another likely customer.

At first, the recommendation engine collects the data from the customer in the form of ratings, comments then stores the data in a standard database and then filters the data to extract the relevant meaningful information required to predict the final recommendations.

There are different types of recommendation engines, let’s see it one by one..

2. Content-based Filtering

Content-based filtering is based on the description of the product or the keywords used to specify more about the product. This filtering technique studies user’s preferred choices and then delivers the most relevant recommendations. Consider, if you like a product in category ‘A’ then you may get most recommendations of products lied in category ‘A’ only. When a user usually watches a movie in the romance genre then it is considered that he mostly likes the romantic movies only then recommendation engine will also suggest most popular romantic movies to the user.

There are mainly two vectors i.e. profile vector which contains the past behavior of the user and the item vector which contains the details of each movie like genre, overview, etc.

Now to deliver the most relevant recommendations, the content-based filtering technique finds the cosine of the angle between the profile vector and item vector known as cosine similarity.

Suppose A = profile vector and B = item vector, then the similarity between them can be calculated as:

Based on the cosine similarity values, which range between -1 to 1, the movies are arranged in descending order and deliver the topmost recommendations to the user. The main drawback of this technique is that it recommends movies in the same genre only. If we want recommendations from another movie genre then it might not perform well.

3. Collaborative Filtering

Mainly collaborative filtering techniques deal with user’s preferences, activities, and behavior. They give recommendations to the user based on similarity with the other likely users. For the collaborative filtering techniques we don’t need any additional information just need to collect and analyze user’s behavior.

Further, there are several types of collaborative filtering algorithms:-

User-User collaborative filtering

Deep Learning for Collaborative Filtering (using FastAI) – mc.ai — Source : mc.ai

As the name suggests, this collaborative technique determines the similarity between the users. Based on similarity score, it brings out the similar users and then recommends the products to them whatever they purchased before.

Users having higher correlation will tend to be similar. Let’s understand it with an example, consider user1 watched movie1, movie2, movie3 and user2 watched movie1, movie3, movie4 so based on the similarity with user1, user2 gets recommendations to watch movie2 and similarly user1 get recommendations to watch movie4.

To calculate the similarity for each user and prediction for each similarity score, it takes much time to compute. when the number of users is less then only this technique is very feasible to suggest recommendations.

Item-Item collaborative filtering

This technique is almost similar to the user-user collaborative technique, instead of finding similarities between users we find the similarity score between the items.

So in our movie case example, we find similarities between the movies and then recommend them based on the user’s interest and behavior. Suppose movie1 and movie3 have the best similarity score and the user has watched the movie3, then definitely movie1 will be recommended to the user.

When a new user is introduced, it is hard to recommend products because he has no browsing history so we suggest the most popular products determined overall in the dataset. When a new product is introduced, we have to wait for the user's action, and then only we can give recommendations.

4. Hybrid Recommendation System

A hybrid recommendation system is a combination of collaborative and content-based recommendations. This system can be implemented by making content-based and collaborative-based predictions separately and then combining them and vice-versa.

5. Matrix Factorization

In the user-item matrix, there are two dimensions:

The number of users
The number of items

In case the user-item matrix is empty, then we can improve the algorithm performance by reducing the dimensions of the matrix. For that purpose, you can use matrix factorization.

In mathematical terms, factorization is simply known as breaking a large number into a product of its multiples (factors) like 120 = 12 * 10.

Here, we are breaking down the user-item matrix into a product of two small matrices i.e. User matrix and Item matrix. If the User-Item matrix has the dimension of a x b then it can be reduced into a product of two matrices having dimensioned a x k and k x m respectively.

Where k is the set of latent features that can define how a user rates the products.

6. Recommendation Engine Model

Let’s build the recommendation engine model, at first import all the required dependencies in your code editor,

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import ast 
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel,cosine_similarity%matplotlib inline

Now after importing dependencies, we have to import the dataset. I am using MovieLens datasets to build the recommendation engine model. You will get the required dataset here.

movies_df = pd.read_csv('movies_metadata.csv')
ratings = pd.read_csv('ratings_small.csv')
movies_df.columns

Now you can see, we have plenty of columns that are not required to make predictions. So it will better for us to drop them from the movies_df dataframe.

movies_df = movies_df.drop(['belongs_to_collection', 'budget', 'homepage', 'original_language', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'video', 'poster_path', 'production_companies', 'production_countries'], axis = 1)
movies_df.head()

Here, you can check that all the unrequired columns are dropped and we got the data frame containing required columns only.

As the ‘genres’ column in the data frame contains a dictionary with keys id and name so we have to extract the name which is movie genre from the dictionary and separate the column with name values only.

movies_df['genres'] = movies_df['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
movies_df['genres'].head()

You can see the output above. Now we have the modified data and we can make recommendations.

Approach 1: By Weighted Average

All we have to do is sort our movies based on vote_count and vote_average and display the top movies on our list.

#vote_count
V = movies_df[movies_df['vote_count'].notnull()]['vote_count'].astype('float')#vote_average
R = movies_df[movies_df['vote_average'].notnull()]['vote_average'].astype('float')#vote_average_mean
C = R.mean()#minimum votes required to get in list
M = V.quantile(0.95)

Here we define an appropriate value for M, the minimum percentage of votes_counts required to scale that movie in weighted_average ranking.
We use the 95th percentile as our cutoff. In other words, for a movie to feature in the scale ranking, it must have more votes than at least 95% of the movies on the list.

df = pd.DataFrame()
df = movies_df[(md['vote_count'] >= m) & (movies_df['vote_average'].notnull())][['title','vote_count','vote_average','popularity','genres','overview']]

Earlier we defined M value so now we are taking only those movies whose ‘vote_count’ is greater than required vote_counts to scale them.

Now all set to calculate Weighted_average,

df['Weighted_average'] = ((R*V) + (C*M))/(V+M)
recm_movies = df.sort_values('Weighted_average', ascending=False).head(500)
recm_movies.head()

We calculated the Weighted_average for every movie and after sorting the df data frame by Weighted_average values in descending order we store topmost 500 movies whose weighted_average values are highest in the recm_movies data frame.

From the output, The Shawshank Redemption has the highest weighted_average and so on…using visualization we will get meaningful insights.

Here you get the topmost movie recommendations based on weighted_average calculated from votes_count and vote_average.

Approach 2: By Popularity and Genre

Hope you got approach 1 by weighted_average, now we deal with approach 2 by popularity and genre.

Here, we defined a data frame popular which contains movies scaled from high popularity to low popularity.

popular = pd.DataFrame()
popular = recm_movies.copy()
popular['popularity'] = recm_movies[recm_movies['popularity'].notnull()]['popularity'].astype('float')
popular = popular.sort_values('popularity',ascending = False)
popular.head()

We got the above output showing that the Wonder Woman movie has the highest popularity and so on. Let’s visualize it,

Now you can analyze it better and users also get recommended by these popular movies.

If you want to get the recommendations from a particular genre of movies sorted according to the Weighted_average from highest to lowest.

At first, you know that the genres column in the dataset has a list of multiple values, so we have to allocate only one single genre value in a row.

s = df.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_movies = recm_movies.drop('genres', axis=1).join(s)
gen_movies.head(10)
#gen_movies.columns

From seeing the output in the genre column you will get the idea about the function we performed above to get a single value in the genre column.

Let us see our method in action by displaying the Top 10 Action Movies,

df_w = gen_movies[ (gen_movies['genre'] == 'Action') & (gen_movies['vote_count'] >= m)]df_w.sort_values('Weighted_average', ascending = False).head(10)

We get the recommendations of the top 10 Action movies as seen above, Let’s visualize the results..

df_w = df_w.sort_values('Weighted_average', ascending = False)
plt.figure(figsize=(12,6))
axis1=sns.barplot(x=df_w['Weighted_average'].head(10), y=df_w['title'].head(10), data=df_w)
plt.xlim(4, 10)
plt.title('Best Action Movies by weighted average', weight='bold')
plt.xlabel('Weighted Average Score', weight='bold')
plt.ylabel('Action Movie Title', weight='bold')

Here we get the top 10 Action movie recommendations based on Weighted_average score. Now you can also try and get movie recommendations in genre drama, romance, crime, etc.

Approach 3: By Content-based filtering

Let us first try to build a recommendation engine using a movie overview.

Now, whenever we want to create the recommendation engine, for each and every movie we have to create a vector of the matrix. The reason to create a vector is that our recommendation engine depends upon the pairwise similarity. To create this similarity we have to design vectors for each movie.

Now the overview column in the dataset has sentences i.e. collection of strings. so we have to design a TF-IDF vectorizer which used to create document matrix from these sentences.

cont_recm = recm_movies.copy()
cont_recm.head()from sklearn.feature_extraction.text import TfidfVectorizertfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), stop_words = 'english')cont_recm['overview'] = cont_recm['overview'].fillna('')

TF-IDF vectorizer does not convert directly raw data into useful features. Firstly, converting strings into vectors and each word has its own vector. Then we will use the technique for extracting the feature like Cosine Similarity which works on vector matrix. As we understand, we can’t directly pass the string to our recommendation model. So, TF-IDF vectorizer provides numeric values of the entire overview column for us.

# Fitting the TF-IDF on the 'overview' text
tfv_matrix = tfv.fit_transform(cont_recm['overview'])#Finding Cosine_similarity
cos_sim = linear_kernel(tfv_matrix, tfv_matrix)cont_recm = cont_recm.reset_index()
indices = pd.Series(cont_recm.index, index=cont_recm['title'])
indices.head(20)

Since we have used the TF-IDF vectorizer which calculates the Dot Product will directly give us the Cosine Similarity Score.

We have a pairwise cosine similarity matrix for all the movies in our dataset.

Now we write a function that returns the 10 most similar movies based on the cos_sim score.

def sugg_recm(title):
    # Get the index corresponding to original_title
    idx = indices[title]        # Get the pairwsie similarity scores 
    sim_scores = list(enumerate(cosine_sim[idx]))    # Sort the movies 
    sim_scores = sorted(sim_scores, key=lambda x: x[1],reverse=True)
    
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

Now calling the function with the movie title,

sugg_recm('Star Wars').head(10)

Yeah, we got the best movie recommendations based on the cos_sim score related to the movie Star wars. Let’s check for another movie.

sugg_recm('Dilwale Dulhania Le Jayenge').head(10)

Great, we accomplished our target. When you will implement it in your code editor you can also verify the recommendation engine by taking one or more examples and check the results. These movie recommendations are totally based on the cosine similarity score among the movies as discussed in the earlier part.

7. Observations

From approach 1 we got that The Shawshank Redemption movie has the highest weighted_average score.
From approach 2 we got that the Wonder Woman movie is the most popular among all the movies in the dataset. When a new user will be introduced in the dataset then of course this movie will be recommended to him as he won’t have a browsing history at the initial time.
From approach 2 we also got movie recommendations for a particular genre based on scaling of movies according to the Weighted_average score from highest to lowest.
From approach 3 we got recommendations based on content-based filtering using TF-IDF vectorizer and cosine similarity.

8. Conclusion

The product recommendation engine mainly runs on data. The data is in the form of user’s ratings, comments, behavior, preferences, and many more. To drive customers to your business, a recommendation engine is necessary. Along with that to generate revenue, create customer satisfaction, discover new shopping trends, personalize individual interest and provide reports in a more effective manner, all these goals are accomplished with the help of a recommendation engine.

That’s all folks !!

See you in my post !!

Hey everyone!

I’ve been getting a lot of DMs for guidance, so decided to take action on it. I’m excited to help folks out and give back to the community via Topmate. Feel free to reach out if you have any questions or just want to say hi!

All dataset files and python notebook is available on my Github repo.

Amey23/Recommention-System

This repository contains the code for building movie recommendation system. Approach 1 :- By weighted average Approach…

github.com