Zomato Restaurant Rating Prediction

12 min readOct 28, 2019

In this blog, we are going to learn below thing,
- How to handle dataset when values are missing. How to fill up missing values?
- How to deal with categorical variables on the regression model.
- How to perform one-hot encoding on the categorical model.
- How to do feature engineering on categorical variables.
- How to do univariate analysis, NLP task using text data.

Github link. Source code can be found here

1. Business Problem

1.1 Problem Description

Restaurants from all over the world can be found here in Bengaluru. From United States to Japan, Russia to Antarctica, you get all type of cuisines here. Delivery, Dine-out, Pubs, Bars, Drinks, Buffet, Desserts you name it and Bengaluru has it. Bengaluru is best place for foodies. The number of restaurant is increasing day by day. Currently which stands at approximately 12,000 restaurants. With such a high number of restaurants. This industry hasn’t been saturated yet. And new restaurants are opening every day. However, it has become difficult for them to compete with already established restaurants. The key issues that continue to pose a challenge to them include high real estate costs, rising food costs, shortage of quality manpower, fragmented supply chain and over-licensing. This Zomato data aims at analyzing the demography of the location. Most importantly it will help new restaurants in deciding their theme, menus, cuisine, cost etc for a particular location. It also aims at finding similarities between neighborhoods of Bengaluru on the basis of food.

- Does demography of area matters?
- Does location of particular type of restaurant depends on people living in that area>
- Does theme of restaurant matters?
- Is food chain category restaurant likely to have more customers than its counterpart?
- Are any neighborhood on similar based on the type of food?
- Is particular neighbors is famous for its own kind of food?
- If two neighbors are similar does that mean these are related or a particular group of people live in a neighbourhood or these are places to eat?
- What kind of food is famous in locality.
- Do entire locality loves veg food, if yes then locality populated by a particular set of people eg Jain, Gujarati, Marwadi who are basically veg.

1.2 Problem Statement

The dataset also contains reviews for each of the restaurants which will help in finding an overall rating for the place. So we will try to predict rating for a particular restaurant.

1.3 Real-world/Business Objectives

We need to predict rating based on different parameters like Average_cost for two people, Online Order available, foods, menu list, most liked dishes etc features.

1.4 Machine Learning Formulation

Here we suppose to predict rating of a restaurant, so it is basically a Regression problem.

1.5 Performance Metric

We will try to reduce Mean Square Error ie MSE as minimum as possible. So it is a Regression problem reducing MSE.
- Ideal MSE is 0.

2. Data Acquire:

source: https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants

2.1 Data Understanding:

Data is available in .csv format. Dataset columns are as follows,

We have 51717 rows and 17 columns.

2.1.1. Data Preprocessing

Remove Duplicate values
Remove Null Values

We observed that in ‘dish_liked’ 48.22% data is missing. Similarly in ‘Rate’ colomn, 10.22% data is missing. If we directly throw all NULL data out, we have to ignore 48.22% of original data. Can we somehow fill the missing data? So we can have two approaches.

1. Filling the missing values with appropriate values then operate.
2. Throw all null values and then operate.

1. Filling the missing values with appropriate values then operate.

In ‘reviews_list’ column there are some rating values that are closer to ‘Rate’ column value. So we can fill missing values in ‘Rate’ column with ‘reviews_list’ column.

Similarly, we can fill missing values in ‘dist_like’ column with ‘menu_list’. We have to keep that in mind we are replacing only missing value, not whole column. If ‘menu_list’ and ‘dist_like’ both values are missing then we will throw such rows.

After removing null values and filling missing values we will have 36832 rows data.

2.1.2. Data Visualization

Explore Rate Column. Rate Distribution.

2. Finding Top 20 Restaurant in Banglore City.

Onesta, CCD,Kanti Sweets are a top restaurants in Banglore.

3. Online Order service offer by how many restaurants?

almost 25k restaurant accept online order, 11k do not accept.

4. How many restaurants has a book table service?

almost 30k restaurants do not provide booking table facility.

5. In Banglore, which area has maximum restaurants.

Most of the restaurants are located in BTM area, followed by Koramangala.
Note pie chart explain top 10 locations.

Now we know that we have most of the restaurants are in BTM are let's find an exact number of restaurant in specific area/

5. What type of restaurant are there in banglore? also percentage and counts

These are top 10 types of restaurant in banglore city.

Quick Bites
Casual Dining
Cafes

6. What is the Average cost in restaurants? It is the average cost for two people.

Almost 300–400 is the average cost in all restaurants. It means avg cost for 2 people is 300–400.

7. Which dish are most famous/favourite dish in restaurants?

Chicken is most favorite dish in Banglore followed by biryani.

It is word cloud for favorite dishes, most famous dishes have larger word font, vice versa

8. Let's see ‘Rate’ vs ‘Restaurant type’ graph.

9. Top 10 cuisines in Banglore

North Indian food is famous followed by chinese food.

3. Model

Till now we took so much time to understand the data as well we visualize the data, now the actual machine learning part starts from here.

After deep-diving into we can clearly say that ‘online_order’, ‘book_table’, ‘vote’, ‘location’,‘rest-type’, ‘cuisines’ and ‘average_cost’ are important columns rest; we can drop other columns.‘Rate’ is output column as discussed earlier.

3.1 Split Data

Always remain we should first split data then apply featurization, to avoid data leakage problems. Divide data into Train, Test part.

3.2. Data featurization

We will convert all online_order’, ‘book_table’, ‘location’, ‘rest-type’ and ‘cuisines’ features into Categorical features. Then we will use one-hot encoding technique for featurization.

We will add two type of featurization,
1. One-Hot encoding
2. Response coding (Mean Value Replacement)

3.3 Build a Random Model (finding worst-case MSE)

import random

rand_pred= np.zeros(y_test.shape[0])
for i in range(y_test.shape[0]):
    rand_probs = round(random.uniform(1.0, 5.0),2)
    rand_pred[i] = rand_probs

mse(y_test, rand_pred)

In above code, we implemented a random model, which randomly choose values between 1.0 to 5.0. So Random Model gives MSE value = 2.12.

Now, 2.12 is our threshold value or guiding value. If MSE is less than 2.12 value then we can understand the model’s efficiency. So in our case, the ideal minimum MSE value is 0 and if the model gives MSE value greater than 2.12 then the model is worse than random model.

3.4 Apply different Models

We will apply Linear Regression

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train,y_train)
y_pred_lr = lr.predict(X_test)

mse(y_test, y_pred_lr)

Then apply SGDRegressor model

from sklearn import linear_model

sgdReg = linear_model.SGDRegressor()
sgdReg.fit(X_train,y_train)
y_pred_sgdr = sgdReg.predict(X_test)

mse(y_test, y_pred_sgdr)

Then finally apply Random Forest Regressor model.

from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor()
rfr.fit(X_train,y_train)
y_pred_rfr = rfr.predict(X_test)

mse(y_test, y_pred_rfr)

Linear Regression (MSE) = 0.1278,
SGD classifier =2.68e+30
Random Forest Regressor (MSE) = 0.03706

The above value got without Hyper-param tuning, so we will perform the same on Random Forest Regressor model because it is learning something.

tuned_parameters = {'n_estimators': [250,500,1000,1200]}

grd_regressor = GridSearchCV(RandomForestRegressor(), tuned_parameters, cv=10, 
                   n_jobs=-1, verbose=1, scoring=mse_scorer)
grd_regressor.fit(X_train, y_train)

Now we take a 2nd approach to throw all missing values.

Till now, we have considered ONE-HOT encoding of on below features.

- rest_type
- location
- cuisines
- online_order
- book_table
Here we are going to include below features also,

- dish_liked
- cuisines
Obviously we have to deal with large feature dimensions.

This time we will drop all Null values. Last time we try to save some Null (filling missing values) values by converting them to relative values. But in this run, we will neglect all values null. Initially, there are 51k values by removing NULL it will be somewhere around 23k. Frankly speaking, 23k is also good enough points to experiment.

We will use the same One-Hot encoding featurization on 23k data points.

After applying same models on processed data we can get below result,

Linear Regression (MSE) = 0.04308,
SGD classifier =9.86e+28
Random Forest Regressor (MSE) = 0.01542

Again, after hyper-param tuning, with RFR (MSE)= 0.01410

As discussed in section 3.2 we will apply two featurization, we saw one-hot encoded features performance. Now we will do test response coded features.

3.2.2 Feature Engineering

Let’s try response coding in a categorical variable on the regression model. Basically what we are going to do replace categorical features with response coded features. In simple words, we are going to consider each categorical feature once and find mean value of ‘Rate’ column.

Eg.
Consider “online_order” feature, which has two categories, ‘Yes’ and ‘No’. So we will do a small hack, which is explained below,
1. consider category as ‘Yes’ in ‘online_order’, take a mean value of ‘Rate’
2. similarly consider the second category as ‘No’ in ‘online_order’, take a mean value of ‘Rate’ column.
3. We will perform the above logic using group_by on a desired categorical column and simply take a mean of ‘Rate’ column.
4. Then we will create new column which will contain mean values.
5. we will be called it a MEAN VALUE REPLACEMENT.

def provide_response_coded_features(groupByVal,columnName, df):
    
    '''
    This function is used to convert categorical features into response coded features.
    It simply perform MEAN VALUE REPLACEMENT.
    '''
    mean_df = df.groupby([groupByVal]).mean()
    mean_dict =mean_df['Rate'].to_dict()
    key_dict.update([ (groupByVal, mean_dict) ] )
    for k, v in mean_dict.items():
            mean_dict[k] = round(v,2)
    df[columnName] = df[groupByVal].map(mean_dict) 
    return df

Online_order is a categorical variable that has ‘Yes’ and ‘No’ category.

We calculated mean value of rate column. firstly consider category is ‘Yes’ then calc mean, similarly category is ‘No’ calc mean value. For category is ‘Yes’ mean value is 3.89 and for ‘No’ mean value is 3.93. Then create new column ‘mean_online_order’ place all values. observe 3.89 value all over there where Online_order=Yes.Vice versa

Similarly, we can do for book_table column. Here again, we have two categories. Yes and No. Mean value for No = 3.81 and Mean value for Yes = 4.16.

Create a new column name ‘mean_book_table’ and place all the values.

Similarly, we can do for,
rest_type
location
cuisines
dish_liked features

There is a small question that how can we apply a response coded feature on ‘unseen future’ data. It depends on existing features and datasets. In our case, we can simply ignore such categories.

Again apply three model and result for the same are as follows,
Linear Regression (MSE) = 0.00948
SGD Regression (MSE) = 3.39e+30
Random Forest Regressor = 0.00318

This is brilliant we reduce MSE to 0.003, which is best among all the models we tried till now.

4. Model Comparision

Five one-Hot encoded features with missing value filling with approx values
Linear Regression (MSE) = 0.1278,
SGD classifier =2.68e+30
Random Forest Regressor (MSE) = 0.03706

Seven one-hot encoded features (removed null values)
Linear Regression (MSE) = 0.04308,
SGD classifier =9.86e+28
Random Forest Regressor (MSE) = 0.01542

Response coded features (removed null values)
Linear Regression (MSE) = 0.00948
SGD Regression (MSE) = 3.39e+30
Random Forest Regressor = 0.00318

We have built a model using categorical features, But we ignore the most precious feature which ‘review_list’ features. “Review_List” feature is a list of reviews by customers about the restaurant. We can do NLP on this text data. Just read below reviews by customers

1. I would totally recommend to visit this place once, the place is nice and comfortable  food wise alla beautiful place to dine in the interiors take you back to the mughal era  the lightings are just perfect. 
2. We went there on the occasion of christmas and so they had only limited items available  but the taste and service was not compromised at all the only complaint is that the breads could have been better would surely like to come here again. 
3. I was here for dinner with my family on a weekday  the restaurant was completely empty ambience is good with some good old hindi music  seating arrangement are good too.  
4. we ordered masala papad, panner and baby corn starters, lemon and corrionder soup, butter roti, olive and chilli paratha  food was fresh and good, service is good too  good for family hangout cheers.

5. NLP

So after reading the above review let’s try NLP features.

Firstly we will do text preprocessing. Remove “Regex expression”, “StopWords” and “Stemming” and after processing store to dataframe.

We tried Bag of word CountVectorizer featurization. Tried on Random Forest regressor which gives

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(ngram_range=(1,1), min_df=10) #in scikit-learn
# train data
X_train_bow = count_vect.fit_transform(x_tr_txt)

# test data
x_cv_bow = count_vect.transform(x_cv_txt)
x_test_bow = count_vect.transform(x_test_txt)from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor()
rfr.fit(X_train_bow,y_tr)
y_pred_rfr = rfr.predict(x_cv_bow)

mse(y_cv, y_pred_rfr)### which gives MSE 0.04505 ##

Then we tried LSTM model.

# create the model
embedding_vecor_length = 256
model = Sequential()
model.add(Embedding(top_words+1, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(200))  # returns a sequence of vectors of dimension 32
model.add(Dropout(0.5))
model.add(Dense(1, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam')
print(model.summary())

Followed by tried GRU.

# create the model
embedding_vecor_length = 256
model = Sequential()
model.add(Embedding(top_words+1, embedding_vecor_length, input_length=max_review_length))
model.add(GRU(200))  # returns a sequence of vectors of dimension 32
model.add(Dropout(0.5))
model.add(Dense(1, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam')
print(model.summary())

6. All Features

Let’s try the last experiment.

We have all types of features such as categorical features, numerical features and text features (NLP). We can add up all the features and check the output.

Text data Features:

1. Review List

Categorical features :

1. Book Table
2. Online Order
3. Location

Numerical Features:

1. Votes
2. Average cost (average cost for two persons)
3. word count (this is feature engineering hack. We have counted no of words in a sentence of ‘review list’.)

Plan of attack:

Step1: For text features, we will use the pre-trained Word2Vec model, introduce word embedding, pass through the LSTM layer then flatten the output.
Step2 : Similarly we will perform embedding on categorical data and flatten it.
Step3 : Merge all numerical features and scale it.
Step4 : We will concate all the features under one block.
Step5 : pass this blog through NN, let’s the output.

All features NN model performance MSE vs epoch

All features NN model weights distribution

NLP Feature — BoW Random Forest Regressor = 0.045(MSE)
NLP Feature — LSTM = 0.05 (MSE)
NLP Feature — GRU = 0.044 (MSE)
ALL feature — NN = 0.106 (MSE)

7. Summary

Great Learning..!!!

We collect data from CSV file, half of the values were missing, we did not throw up all values, instead of throw NULL value we tried to fill estimate values using related column.
Firstly we design a random model, a random model is something which randomly chooses values from 1.0 to 5.0, such model gives MSE 2.12.

We tried only 5 one-hot encoded features and try different models Random Forest Regressor was most learning model, so we tune model using grid-search technic, minimal MSE = 0.03485.
Then we tried with 7 one-hot encoded features and try on different models. Again Random Forest regressor was winning the race.
we achieved MSE = 0.01404.

Then we done some Feature Engineering, used response coded feature, but this time “Linear Regression” perform well than previous model, Random Forest Regressor is winning the race as usual. we achieved MSE =0.00353.

Then we moved focus to NLP features, we tried simple BOW, LSTM, GRU, and NN models. single-handedly NLP feature reduces MSE value to a great extent. We then gave last try to combine all features and train a model but that didn’t give expected result. But we have to keep experimenting with the data.