Nagesh Singh Chauhan
Jul 5 · 12 min read

In this 2 part article series we will learn how to build your own recommendation engine with the help of Python, from basic models to content-based and collaborative filtering recommender systems.

Recommender Engines or Systems are among the most popular applications of data science today. They are used to predict the “rating” or “preference” that a user would give to an item. Almost every major tech company has applied them in some form or the other: Amazon uses it to suggest products to customers, YouTube uses it to decide which video to play next on autoplay, and Facebook uses it to recommend pages to like and people to follow. What’s more, for some companies -think Netflix and Spotify, the business model and its success revolves around the potency of their recommendations.

Netflix’s long list of suggested movies and TV shows is a fantastic example of personalized user experience. In fact, about 70 percent of everything users watch is a personalized recommendation, according to the company.

Getting to that point hasn’t been easy, and improving on its recommendation system is an ongoing process. Netflix has spent well over a decade developing and refining its recommendations.

In a very general way:

Recommender systems are algorithms aimed at suggesting relevant items to users (items being movies to watch, text to read, products to buy or anything else depending on industries).

Why Recommender systems are even needed?

As the World Wide Web continues to grow at an exponential rate, the size and complexity of many web sites grow along with it. For the users of these web sites, it becomes increasingly difficult and time-consuming to find the information they are looking for. User interfaces could help users find the information that is in accordance with their interests by personalizing a web site.

Some web sites present users with personalized information by letting them choose from a set of predefined topics of interest. Users, however, do not always know what they are interested in beforehand and their interests may change over time which would require them to change their selection frequently. Recommender systems provide personalized information by learning the user’s interests from traces of interaction with that user.

Broadly, Recommender Systems can be classified into 3 types:

source
  • Simple recommenders: offer generalized recommendations to every user, based on the popularity of the restaurant. The basic idea behind this system is that restaurants that are more popular and critically acclaimed will have a higher probability of being liked by the average audience.
  • Content-based recommenders: suggest similar items based on a particular item. This system uses item metadata, such as Locality, Cuisine, rating, etc. for restaurants, to make these recommendations. The general idea behind these recommender systems is that if a person liked a particular item, he or she will also like an item that is similar to it.
  • Collaborative filtering engines: these systems try to predict the rating or preference that a user would give an item-based on past ratings and preferences of other users. Collaborative filters do not require item metadata like its content-based counterparts.

In this 1st part of this article series, we will see how to build a basic model of simple as well as content-based recommender systems. While these models will be nowhere close to the industry standard in terms of complexity, quality or accuracy, it will help you to get started with building more complex models that produce even better results.

In the 2nd part, we will create a collaborative filtering recommender system using K-Nearest Neighbour(K-NN) machine learning algorithm. You can get comfortable with the K-NN algorithm before going ahead with Collaborative filtering.

Let's start building a Restaurant Recommendation Engine using the techniques discussed above which should be capable of recommending restaurants which best suits you.

We will use Zomato restaurants data and it can be downloaded from here.

source

Simple Recommenders

As described in the previous section, simple recommenders are basic systems that recommend the top items based on a certain score. In this section, you will build a simplified model which will give you the top 10 restaurants of your city based on user rating and score.

The following are the steps involved:

  • Decide on the metric or score to rate restaurants.
  • Calculate the score for every restaurant.
  • Sort the restaurants based on the score and output the top results.

Load the libraries

import numpy as np 
import pandas as pd
import re
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode()
import plotly.graph_objs as go
import seaborn as sns
import matplotlib.pyplot as plt

Load the datasets

data = pd.read_csv('/Users/nageshsinghchauhan/Downloads/ML/recommend/zomato.csv', encoding ='latin1')country = pd.read_excel("/Users/nageshsinghchauhan/Downloads/ML/recommend/Country-Code.xlsx")

Now, let's merge the two files.

data1 = pd.merge(data, country, on='Country Code')

Start with some data exploration.

  1. Let us check countries where the maximum number of restaurants are registered on Zomato.
abels = list(data1.Country.value_counts().index)
values = list(data1.Country.value_counts().values)
fig = {
"data":[
{
"labels" : labels,
"values" : values,
"hoverinfo" : 'label+percent',
"domain": {"x": [0, .9]},
"hole" : 0.6,
"type" : "pie",
"rotation":120,
},
],
"layout": {
"title" : "Zomato's Presence around the World",
"annotations": [
{
"font": {"size":20},
"showarrow": True,
"text": "Countries",
"x":0.2,
"y":0.9,
},
]
}
}
iplot(fig)
pie chart showing countries where the maximum number of restaurants are registered on Zomato

2. Let us check Zomato’s presence in the top 10 Indian cities.

res_India = data1[data1.Country == 'India']labels1 = list(res_India.City.value_counts().index)
values1 = list(res_India.City.value_counts().values)
labels1 = labels1[:10]
values1 = values1[:10]
fig = {
"data":[
{
"labels" : labels1,
"values" : values1,
"hoverinfo" : 'label+percent',
"domain": {"x": [0, .8]},
"hole" : 0.6,
"type" : "pie",
"rotation":120,
},
],
"layout": {
"title" : "",
"annotations": [
{
"font": {"size":20},
"showarrow": True,
"text": "Cities",
"x":0.2,
"y":0.9,
},
]
}
}
iplot(fig)
pie chart showing the top 10 cities where the maximum number of restaurants are registered on Zomato

3. Number of restaurants in NCR(4 cities New Delhi, Gurgaon, Noida, Faridabadtogether called NCR) with aggregate rating ranging from 1.9 to 4.9

NCR = ['New Delhi','Gurgaon','Noida','Faridabad']
res_NCR = res_India[(res_India.City == NCR[0])|(res_India.City == NCR[1])|(res_India.City == NCR[2])|
(res_India.City == NCR[3])]
agg_rat = res_NCR[res_NCR['Aggregate rating'] > 0]
f, ax = plt.subplots(1,1, figsize = (14, 4))
ax = sns.countplot(agg_rat['Aggregate rating'])
plt.show()
Average rating in NCR region

4. Top 10 Cuisines served by restaurants.

res_India['Cuisines'].value_counts().sort_values(ascending=False).head(10)
res_India['Cuisines'].value_counts().sort_values(ascending=False).head(10).plot(kind='pie',figsize=(10,6),
title="Most Popular Cuisines", autopct='%1.2f%%')
plt.axis('equal')
Top 10 Cuisines served by restaurants.

Now, we are going to select Country as “India” and the city as NCR(New Delhi, Gurgaon, Noida, Faridabad).

res_India = data1[data1.Country == 'India']
NCR = ['New Delhi','Gurgaon','Noida','Faridabad']
res_NCR = res_India[(res_India.City == NCR[0])|(res_India.City == NCR[1])|(res_India.City == NCR[2])|
(res_India.City == NCR[3])]

One of the most basic metrics you can think of is the rating. However, using this metric has a few caveats. For one, it does not take into consideration the popularity of a restaurant. Therefore, a restaurant with a rating of 9 from 10 voters will be considered ‘better’ than a restaurant with a rating of 8.9 from 10,000 voters.

On a related note, this metric will also tend to favor restaurants with a smaller number of voters with skewed and/or extremely high ratings. As the number of voters increases, the rating of a restaurant regularizes and approaches towards a value that is reflective of the restaurant’s quality. It is more difficult to discern the quality of a restaurant with extremely few voters.

Taking these shortcomings into consideration, it is necessary that you come up with a weighted rating that takes into account the average rating and the number of votes it has garnered.

Here, we will use its weighted rating formula as our metric/score. Mathematically, it is represented as:

where WR is Weighted Rating,

  • v is the number of votes for the restaurant;
  • m is the minimum votes required to be listed in the chart;
  • R is the average rating of the restaurant; And
  • C is the mean vote across the whole report

You already have the values to v ("Votes”) and R (“Aggregate rating”) for each restaurant in the dataset. It is also possible to directly calculate C from this data.

What you need to determine if an appropriate value for m, the minimum votes required to be listed in the chart. There is no right value for m. You can view it as a preliminary negative filter that ignores restaurants which have less than a certain number of votes. The selectivity of your filter is up to your discretion.

In this case, you will use the 90th percentile as your cutoff. In other words, for a restaurant to feature in the charts, it must have more votes than at least 90% of the restaurants in the list. (On the other hand, if you had chosen the 75th percentile, you would have considered the top 25% of the restaurants in terms of the number of votes garnered. As percentile decreases, the number of restaurants considered increases. Feel free to play with this value and observe the changes in your final chart).

As a first step, let’s calculate the value of C, the mean rating across all restaurants:

data_new_delphi=res_NCR[['Restaurant Name','Cuisines','Locality','Aggregate rating', 'Votes']]C = data_new_delphi['Aggregate rating'].mean()
print(C)
#2.39583438526

The average rating of a restaurant in the NCR region is around 2.39, on a scale of 5.

Next, let’s calculate the number of votes, m, received by a restaurant in the 90th percentile. The pandas library makes this task extremely trivial using the .quantile() method of a pandas Series:

m = data_new_delphi['Votes'].quantile(0.90)
print(m)
#234.0

Next, you can filter the restaurants that qualify for the chart, based on their vote counts:

# Filter out all qualified restaurants into a new DataFrame
q_restaurant = data_new_delphi.copy().loc[data_new_delphi['Votes'] >= m]
q_restaurant.shape
#(795, 5)

You use the .copy() method to ensure that the new q_restaurant dataframe created is independent of your original metadata DataFrame. In other words, any changes made to the q_restaurant DataFrame does not affect the metadata.

You see that there are 795 restaurants which qualify to be in this list. Now, you need to calculate your metric for each qualified restaurant. To do this, you will define a function, weighted_rating() and define a new feature score, of which you'll calculate the value by applying this function to your DataFrame of qualified restaurants:

# Function that computes the weighted rating of each restaurant
def weighted_rating(x, m=m, C=C):
v = x['Votes']
R = x['Aggregate rating']
# Calculating the score
return (v/(v+m) * R) + (m/(m+v) * C)
# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_restaurant['score'] = q_restaurant.apply(weighted_rating, axis=1)

Finally, let’s sort the DataFrame based on the score feature and output the Restaurant Name, Votes, Aggregate rating and weighted rating or score of the top 10 restaurants.

#Sort restaurant based on score calculated above
q_restaurant = q_restaurant.sort_values('score', ascending=False)
#Print the top 10 restaurants in Delhi NCR
q_restaurant[['Restaurant Name','Cuisines', 'Locality','Votes', 'Aggregate rating', 'score']].head(10)
Top 10 restaurants in NCR region

As you can see, it recommends top rated restaurants based on Aggregate rating.


Content-Based Recommender

The recommender we built just now suffer some severe limitation that gives the same recommendation to everyone, regardless of the user’s personal preference.

Source: OfferZen.com

In this section, we are going to build an engine that computes the similarity between restaurants based on certain parameter/metric(s) and suggests restaurants that are most similar to a particular restaurant that a user liked(user input).

Let's start building by loading all the required libraries and our original Zomato restaurants data.

import numpy as np 
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from nltk.tokenize import word_tokenize
import seaborn as sns
import matplotlib.pyplot as plt
#load the dataset
data = pd.read_csv('/Users/nageshsinghchauhan/Downloads/ML/recommend/zomato.csv', encoding ='latin1')

Remove NULL values from the City column.

data['City'].value_counts(dropna = False)

Select one city, I’ll go ahead with Delhi because it has the maximum number of Zomato restaurants.

data_city =data.loc[data['City'] == 'New Delhi']

Now get all the Restaurant Name, Cuisines, Locality, Aggregate rating in Delhi.

data_new_delphi=data_city[['Restaurant Name','Cuisines','Locality','Aggregate rating']]

Remove NULL values from Locality column.

data_new_delphi['Locality'].value_counts(dropna = False).head(5)

Now select a locality in Delhi, let us say “Connaught Place” (you can choose any locality as per your choice).

data_new_delphi.loc[data['Locality'] == 'Connaught Place']

Our next step is to create a function that will take Locality and Restaurant Name as input parameters and will give the top 10 recommended restaurants.

In its current form, it is not possible to compute the similarity between any two Locality. To do this, you need to compute the word vectors of each Locality or document.

You will compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each document. This will give you a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document) and each column represents a restaurant, as before.

TF-IDF is the statistical method of evaluating the significance of a word in a given document.

TF — Term frequency(tf) refers to how many times a given term appears in a document.

IDF — Inverse document frequency(idf) measures the weight of the word in the document, i.e if the word is common or rare in the entire document.

The TF-IDF intuition follows that the terms that appear frequently in a document are less important than terms that rarely appear.

Fortunately, scikit-learn gives you a built-in TfIdfVectorizer class that produces the TF-IDF matrix quite easily.

Now we have this matrix, we can easily compute a similarity score. There are several options to do this; such as the Euclidean, the Pearson, and the cosine similarity scores. Again, there is no right answer to which score is the best.

We will be using the cosine similarity to calculate a numeric quantity that denotes the similarity between the two restaurants. You use the cosine similarity score since it is independent of magnitude and is relatively easy and fast to calculate (especially when used in conjunction with TF-IDF scores). Mathematically, it is defined as follows:

Source: Cosine similarity

Since you have used the TF-IDF vectorizer, calculating the dot product will directly give you the cosine similarity score. Therefore, you will use sklearn's linear_kernel() instead of cosine_similarities() since it is faster.

Steps to follow :

1. Data consist of the only location
2. Reset index for cosine similarity because the Cosine similarity index has to be same value with the result of TF-IDF vectorizer
3. Feature Extraction
4. Applying TF-IDF Vectorizer
5. Compute Cosine Similarity
6. Aggregate rating added with cosine score in a list
7. Sort the restaurant names based on the Cosine similarity scores

data_sample=[]def restaurant_recommend_func(location,title):   
global data_sample
global cosine_sim
global sim_scores
global tfidf_matrix
global corpus_index
global feature
global rest_indices
global idx

# When location comes from function ,our new data consist only location dataset
data_sample = data_new_delphi.loc[data_new_delphi['Locality'] == location]

# index will be reset for cosine similarty index because Cosine similarty index has to be same value with result of tf-idf vectorize
data_sample.reset_index(level=0, inplace=True)

#Feature Extraction
data_sample['Split']="X"
for i in range(0,data_sample.index[-1]):
split_data=re.split(r'[,]', data_sample['Cuisines'][i])
for k,l in enumerate(split_data):
split_data[k]=(split_data[k].replace(" ", ""))
split_data=' '.join(split_data[:])
data_sample['Split'].iloc[i]=split_data

#TF-IDF vectorizer
#Extracting Stopword
tfidf = TfidfVectorizer(stop_words='english')
#Replace NaN for empty string
data_sample['Split'] = data_sample['Split'].fillna('')
#Applying TF-IDF Vectorizer
tfidf_matrix = tfidf.fit_transform(data_sample['Split'])
tfidf_matrix.shape

# Using for see Cosine Similarty scores
feature= tfidf.get_feature_names()
#Cosine Similarity
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Column names are using for index
corpus_index=[n for n in data_sample['Split']]

#Construct a reverse map of indices
indices = pd.Series(data_sample.index, index=data_sample['Restaurant Name']).drop_duplicates()

#index of the restaurant matchs the cuisines
idx = indices[title]
#Aggregate rating added with cosine score in sim_score list.
sim_scores=[]
for i,j in enumerate(cosine_sim[idx]):
k=data_sample['Aggregate rating'].iloc[i]
if j != 0 :
sim_scores.append((i,j,k))

#Sort the restaurant names based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: (x[1],x[2]) , reverse=True)
# 10 similar cuisines
sim_scores = sim_scores[0:10]
rest_indices = [i[0] for i in sim_scores]

data_x =data_sample[['Restaurant Name','Aggregate rating']].iloc[rest_indices]

data_x['Cosine Similarity']=0
for i,j in enumerate(sim_scores):
data_x['Cosine Similarity'].iloc[i]=round(sim_scores[i][1],2)

return data_x
# Top 10 similar restaurant with cuisine of 'Pizza Hut' restaurant in Connaught Placerestaurant_recommend_func('Connaught Place','Pizza Hut')

As an input, we provided Connaught Place as Locality and Pizza Hut as the Restaurant Name.

Top 10 restaurants similar to Pizza Hut

As we can see our engine top 10 restaurants in Delhi which are similar to Pizza Hut.

Now change only the restaurant name, let's say Barbeque Nation.

Top 10 restaurants similar to Barbeque Nation

Conclusion

In this article, we have learned how to make a Simple Recommender Engine and Content-based Recommendation Engine. In the next article, we will see how to build a collaborative-filtering Recommendation Engine using K-NN machine learning algorithm.

Hope you guys have enjoyed reading this article, let me know about your views/suggestions/questions in the comment section.

You can also reach me out over LinkedIn for any query.

How to build a Restaurant Recommendation Engine (Part-2)

Thanks for reading !!!

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Nagesh Singh Chauhan

Written by

Data Science | Big Data | Machine Learning | Python | https://www.linkedin.com/in/nagesh-singh-chauhan-6936bb13b/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade