# How to build a Restaurant Recommendation Engine (Part-1)

In this 2 part article series we will learn how to build your own recommendation engine with the help of Python, from basic models to content-based and collaborative filtering recommender systems.

Recommender Engines or Systems are among the most popular applications of data science today. They are used to predict the “rating” or “preference” that a user would give to an item. Almost every major tech company has applied them in some form or the other: Amazon uses it to suggest products to customers, YouTube uses it to decide which video to play next on autoplay, and Facebook uses it to recommend pages to like and people to follow. What’s more, for some companies -think Netflix and Spotify, the business model and its success revolves around the potency of their recommendations.

Netflix’s long list of suggested movies and TV shows is a fantastic example of personalized user experience. In fact, about 70 percent of everything users watch is a personalized recommendation, according to the company.

Getting to that point hasn’t been easy, and improving on its recommendation system is an ongoing process. Netflix has spent well over a decade developing and refining its recommendations.

In a very general way:

Recommender systems are algorithms aimed at suggesting relevant items to users (items being movies to watch, text to read, products to buy or anything else depending on industries).

**Why Recommender systems are even needed?**

As the World Wide Web continues to grow at an exponential rate, the size and complexity of many web sites grow along with it. For the users of these web sites, it becomes increasingly difficult and time-consuming to find the information they are looking for. User interfaces could help users find the information that is in accordance with their interests by personalizing a web site.

Some web sites present users with personalized information by letting them choose from a set of predefined topics of interest. Users, however, do not always know what they are interested in beforehand and their interests may change over time which would require them to change their selection frequently. **Recommender systems** provide personalized information by learning the user’s interests from traces of interaction with that user.

Broadly, Recommender Systems can be classified into 3 types:

**Simple recommenders**: offer generalized recommendations to every user, based on the popularity of the restaurant. The basic idea behind this system is that restaurants that are more popular and critically acclaimed will have a higher probability of being liked by the average audience.**Content-based recommenders**: suggest similar items based on a particular item. This system uses item metadata, such as Locality, Cuisine, rating, etc. for restaurants, to make these recommendations. The general idea behind these recommender systems is that if a person liked a particular item, he or she will also like an item that is similar to it.**Collaborative filtering engines**: these systems try to predict the rating or preference that a user would give an item-based on past ratings and preferences of other users. Collaborative filters do not require item metadata like its content-based counterparts.

In this **1st part** of this article series, we will see how to build a basic model of simple as well as content-based recommender systems. While these models will be nowhere close to the industry standard in terms of complexity, quality or accuracy, it will help you to get started with building more complex models that produce even better results.

In the **2nd part**, we will create a collaborative filtering recommender system using K-Nearest Neighbour(K-NN) machine learning algorithm. You can get comfortable with the K-NN algorithm before going ahead with Collaborative filtering.

Let's start building a Restaurant Recommendation Engine using the techniques discussed above which should be capable of recommending restaurants which best suits you.

We will use Zomato restaurants data and it can be downloaded from here.

# Simple Recommenders

As described in the previous section, simple recommenders are basic systems that recommend the top items based on a certain score. In this section, you will build a simplified model which will give you the top 10 restaurants of your city based on user rating and score.

The following are the steps involved:

- Decide on the metric or score to rate restaurants.
- Calculate the score for every restaurant.
- Sort the restaurants based on the score and output the top results.

Load the libraries

`import numpy as np `

import pandas as pd

import re

from plotly.offline import init_notebook_mode, iplot

init_notebook_mode()

import plotly.graph_objs as go

import seaborn as sns

import matplotlib.pyplot as plt

Load the datasets

data = pd.read_csv('/Users/nageshsinghchauhan/Downloads/ML/recommend/zomato.csv', encoding ='latin1')country = pd.read_excel("/Users/nageshsinghchauhan/Downloads/ML/recommend/Country-Code.xlsx")

Now, let's merge the two files.

`data1 = pd.merge(data, country, on='Country Code')`

Start with some data exploration.

- Let us check countries where the maximum number of restaurants are registered on Zomato.

abels = list(data1.Country.value_counts().index)

values = list(data1.Country.value_counts().values)fig = {

"data":[

{

"labels" : labels,

"values" : values,

"hoverinfo" : 'label+percent',

"domain": {"x": [0, .9]},

"hole" : 0.6,

"type" : "pie",

"rotation":120,

},

],

"layout": {

"title" : "Zomato's Presence around the World",

"annotations": [

{

"font": {"size":20},

"showarrow": True,

"text": "Countries",

"x":0.2,

"y":0.9,

},

]

}

}

iplot(fig)

2. Let us check Zomato’s presence in the top 10 Indian cities.

res_India = data1[data1.Country == 'India']labels1 = list(res_India.City.value_counts().index)

values1 = list(res_India.City.value_counts().values)

labels1 = labels1[:10]

values1 = values1[:10]fig = {

"data":[

{

"labels" : labels1,

"values" : values1,

"hoverinfo" : 'label+percent',

"domain": {"x": [0, .8]},

"hole" : 0.6,

"type" : "pie",

"rotation":120,

},

],

"layout": {

"title" : "",

"annotations": [

{

"font": {"size":20},

"showarrow": True,

"text": "Cities",

"x":0.2,

"y":0.9,

},

]

}

}

iplot(fig)

3. Number of restaurants in NCR(4 cities `New Delhi, Gurgaon, Noida, Faridabad`

together called NCR) with aggregate rating ranging from 1.9 to 4.9

`NCR = ['New Delhi','Gurgaon','Noida','Faridabad']`

res_NCR = res_India[(res_India.City == NCR[0])|(res_India.City == NCR[1])|(res_India.City == NCR[2])|

(res_India.City == NCR[3])]

agg_rat = res_NCR[res_NCR['Aggregate rating'] > 0]

f, ax = plt.subplots(1,1, figsize = (14, 4))

ax = sns.countplot(agg_rat['Aggregate rating'])

plt.show()

4. Top 10 Cuisines served by restaurants.

`res_India['Cuisines'].value_counts().sort_values(ascending=False).head(10)`

res_India['Cuisines'].value_counts().sort_values(ascending=False).head(10).plot(kind='pie',figsize=(10,6),

title="Most Popular Cuisines", autopct='%1.2f%%')

plt.axis('equal')

Now, we are going to select Country as “India” and the city as NCR(New Delhi, Gurgaon, Noida, Faridabad).

`res_India = data1[data1.Country == 'India']`

NCR = ['New Delhi','Gurgaon','Noida','Faridabad']

res_NCR = res_India[(res_India.City == NCR[0])|(res_India.City == NCR[1])|(res_India.City == NCR[2])|

(res_India.City == NCR[3])]

One of the most basic metrics you can think of is the rating. However, using this metric has a few caveats. For one, it does not take into consideration the popularity of a restaurant. Therefore, a restaurant with a rating of 9 from 10 voters will be considered ‘better’ than a restaurant with a rating of 8.9 from 10,000 voters.

On a related note, this metric will also tend to favor restaurants with a smaller number of voters with skewed and/or extremely high ratings. As the number of voters increases, the rating of a restaurant regularizes and approaches towards a value that is reflective of the restaurant’s quality. It is more difficult to discern the quality of a restaurant with extremely few voters.

Taking these shortcomings into consideration, it is necessary that you come up with a weighted rating that takes into account the average rating and the number of votes it has garnered.

Here, we will use its weighted rating formula as our metric/score. Mathematically, it is represented as:

where WR is Weighted Rating,

*v*is the number of votes for the restaurant;*m*is the minimum votes required to be listed in the chart;*R*is the average rating of the restaurant; And*C*is the mean vote across the whole report

You already have the values to *v* ("`Votes”`

) and *R* (“`Aggregate rating`

”) for each restaurant in the dataset. It is also possible to directly calculate *C* from this data.

What you need to determine if an appropriate value for *m*, the minimum votes required to be listed in the chart. There is no right value for *m*. You can view it as a preliminary negative filter that ignores restaurants which have less than a certain number of votes. The selectivity of your filter is up to your discretion.

In this case, you will use the 90th percentile as your cutoff. In other words, for a restaurant to feature in the charts, it must have more votes than at least 90% of the restaurants in the list. (On the other hand, if you had chosen the 75th percentile, you would have considered the top 25% of the restaurants in terms of the number of votes garnered. As percentile decreases, the number of restaurants considered increases. Feel free to play with this value and observe the changes in your final chart).

As a first step, let’s calculate the value of *C*, the mean rating across all restaurants:

data_new_delphi=res_NCR[['Restaurant Name','Cuisines','Locality','Aggregate rating', 'Votes']]C = data_new_delphi['Aggregate rating'].mean()

print(C)

#2.39583438526

The average rating of a restaurant in the **NCR region** is around 2.39, on a scale of 5.

Next, let’s calculate the number of votes, *m*, received by a restaurant in the 90th percentile. The `pandas`

library makes this task extremely trivial using the `.quantile()`

method of a pandas Series:

`m = data_new_delphi['Votes'].quantile(0.90)`

print(m)

#234.0

Next, you can filter the restaurants that qualify for the chart, based on their vote counts:

`# Filter out all qualified restaurants into a new DataFrame`

q_restaurant = data_new_delphi.copy().loc[data_new_delphi['Votes'] >= m]

q_restaurant.shape

#(795, 5)

You use the `.copy()`

method to ensure that the new `q_restaurant`

dataframe created is independent of your original metadata DataFrame. In other words, any changes made to the `q_restaurant`

DataFrame does not affect the metadata.

You see that there are 795 restaurants which qualify to be in this list. Now, you need to calculate your metric for each qualified restaurant. To do this, you will define a function, `weighted_rating()`

and define a new feature `score`

, of which you'll calculate the value by applying this function to your DataFrame of qualified restaurants:

# Function that computes the weighted rating of each restaurant

def weighted_rating(x, m=m, C=C):

v = x['Votes']

R = x['Aggregate rating']

# Calculating the score

return (v/(v+m) * R) + (m/(m+v) * C)# Define a new feature 'score' and calculate its value with `weighted_rating()`

q_restaurant['score'] = q_restaurant.apply(weighted_rating, axis=1)

Finally, let’s sort the DataFrame based on the `score`

feature and output the `Restaurant Name`

, `Votes`

, `Aggregate rating`

and `weighted rating`

or `score`

of the top 10 restaurants.

#Sort restaurant based on score calculated above

q_restaurant = q_restaurant.sort_values('score', ascending=False)#Print the top 10 restaurants in Delhi NCR

q_restaurant[['Restaurant Name','Cuisines', 'Locality','Votes', 'Aggregate rating', 'score']].head(10)

As you can see, it recommends top rated restaurants based on `Aggregate rating.`

# Content-Based Recommender

The recommender we built just now suffer some severe limitation that* gives the same recommendation to everyone, regardless of the user’s personal preference.*

In this section, we are going to build an engine that computes the similarity between restaurants based on certain parameter/metric(s) and suggests restaurants that are most similar to a particular restaurant that a user liked(user input).

Let's start building by loading all the required libraries and our original Zomato restaurants data.

import numpy as np

import pandas as pd

import re

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import linear_kernel

from nltk.tokenize import word_tokenize

import seaborn as sns

import matplotlib.pyplot as plt#load the dataset

data = pd.read_csv('/Users/nageshsinghchauhan/Downloads/ML/recommend/zomato.csv', encoding ='latin1')

Remove NULL values from the `City`

column.

`data['City'].value_counts(dropna = False)`

Select one city, I’ll go ahead with Delhi because it has the maximum number of Zomato restaurants.

`data_city =data.loc[data['City'] == 'New Delhi']`

Now get all the `Restaurant Name`

, `Cuisines`

, `Locality`

, `Aggregate rating`

in Delhi.

`data_new_delphi=data_city[['Restaurant Name','Cuisines','Locality','Aggregate rating']]`

Remove NULL values from `Locality`

column.

`data_new_delphi['Locality'].value_counts(dropna = False).head(5)`

Now select a locality in Delhi, let us say “**Connaught Place”** (you can choose any locality as per your choice).

`data_new_delphi.loc[data['Locality'] == 'Connaught Place']`

Our next step is to create a function that will take `Locality`

and `Restaurant Name`

as input parameters and will give the top 10 recommended restaurants.

In its current form, it is not possible to compute the similarity between any two `Locality`

. To do this, you need to compute the word vectors of each `Locality`

or document.

You will compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each document. This will give you a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document) and each column represents a restaurant, as before.

TF-IDF is the statistical method of evaluating the significance of a word in a given document.

TF — Term frequency(tf) refers to how many times a given term appears in a document.

IDF — Inverse document frequency(idf) measures the weight of the word in the document, i.e if the word is common or rare in the entire document.

The TF-IDF intuition follows that the terms that appear frequently in a document are less important than terms that rarely appear.

Fortunately, scikit-learn gives you a built-in `TfIdfVectorizer`

class that produces the TF-IDF matrix quite easily.

Now we have this matrix, we can easily compute a similarity score. There are several options to do this; such as the Euclidean, the Pearson, and the cosine similarity scores. Again, there is no right answer to which score is the best.

We will be using the cosine similarity to calculate a numeric quantity that denotes the similarity between the two restaurants. You use the cosine similarity score since it is independent of magnitude and is relatively easy and fast to calculate (especially when used in conjunction with TF-IDF scores). Mathematically, it is defined as follows:

Since you have used the TF-IDF vectorizer, calculating the dot product will directly give you the cosine similarity score. Therefore, you will use `sklearn`

's `linear_kernel()`

instead of `cosine_similarities()`

since it is faster.

Steps to follow :

1. Data consist of the only location

2. Reset index for cosine similarity because the Cosine similarity index has to be same value with the result of TF-IDF vectorizer

3. Feature Extraction

4. Applying TF-IDF Vectorizer

5. Compute Cosine Similarity

6. Aggregate rating added with cosine score in a list

7. Sort the restaurant names based on the Cosine similarity scores

data_sample=[]def restaurant_recommend_func(location,title):

global data_sample

global cosine_sim

global sim_scores

global tfidf_matrix

global corpus_index

global feature

global rest_indices

global idx

# When location comes from function ,our new data consist only location dataset

data_sample = data_new_delphi.loc[data_new_delphi['Locality'] == location]

# index will be reset for cosine similarty index because Cosine similarty index has to be same value with result of tf-idf vectorize

data_sample.reset_index(level=0, inplace=True)

#Feature Extraction

data_sample['Split']="X"

for i in range(0,data_sample.index[-1]):

split_data=re.split(r'[,]', data_sample['Cuisines'][i])

for k,l in enumerate(split_data):

split_data[k]=(split_data[k].replace(" ", ""))

split_data=' '.join(split_data[:])

data_sample['Split'].iloc[i]=split_data

#TF-IDF vectorizer

#Extracting Stopword

tfidf = TfidfVectorizer(stop_words='english')#Replace NaN for empty string

data_sample['Split'] = data_sample['Split'].fillna('')#Applying TF-IDF Vectorizer

tfidf_matrix = tfidf.fit_transform(data_sample['Split'])

tfidf_matrix.shape

# Using for see Cosine Similarty scores

feature= tfidf.get_feature_names()#Cosine Similarity

# Compute the cosine similarity matrix

cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Column names are using for index

corpus_index=[n for n in data_sample['Split']]

#Construct a reverse map of indices

indices = pd.Series(data_sample.index, index=data_sample['Restaurant Name']).drop_duplicates()

#index of the restaurant matchs the cuisines

idx = indices[title]#Aggregate rating added with cosine score in sim_score list.

sim_scores=[]

for i,j in enumerate(cosine_sim[idx]):

k=data_sample['Aggregate rating'].iloc[i]

if j != 0 :

sim_scores.append((i,j,k))

#Sort the restaurant names based on the similarity scores

sim_scores = sorted(sim_scores, key=lambda x: (x[1],x[2]) , reverse=True)# 10 similar cuisines

sim_scores = sim_scores[0:10]rest_indices = [i[0] for i in sim_scores]

data_x =data_sample[['Restaurant Name','Aggregate rating']].iloc[rest_indices]

data_x['Cosine Similarity']=0

for i,j in enumerate(sim_scores):

data_x['Cosine Similarity'].iloc[i]=round(sim_scores[i][1],2)

return data_x# Top 10 similar restaurant with cuisine of 'Pizza Hut' restaurant in Connaught Placerestaurant_recommend_func('Connaught Place','Pizza Hut')

As an input, we provided `Connaught Place`

as `Locality `

and `Pizza Hut`

as the `Restaurant Name`

.

As we can see our engine top 10 restaurants in Delhi which are similar to `Pizza Hut`

.

Now change only the restaurant name, let's say `Barbeque Nation.`

# Conclusion

In this article, we have learned how to make a Simple Recommender Engine and Content-based Recommendation Engine. In the next article, we will see how to build a collaborative-filtering Recommendation Engine using K-NN machine learning algorithm.

Hope you guys have enjoyed reading this article, let me know about your views/suggestions/questions in the comment section.

You can also reach me out over LinkedIn for any query.

**How to build a Restaurant Recommendation Engine (Part-2)**

Thanks for reading !!!