# Zomato Restaurant Rating Prediction

In this blog, we are going to learn below thing,

- How to handle dataset when values are missing. How to fill up missing values?

- How to deal with categorical variables on the regression model.

- How to perform one-hot encoding on the categorical model.

- How to do feature engineering on categorical variables.

- How to do univariate analysis, NLP task using text data.

Github link. Source code can be found here

# 1. Business Problem

## 1.1 Problem Description

Restaurants from all over the world can be found here in Bengaluru. From United States to Japan, Russia to Antarctica, you get all type of cuisines here. Delivery, Dine-out, Pubs, Bars, Drinks, Buffet, Desserts you name it and Bengaluru has it. Bengaluru is best place for foodies. The number of restaurant is increasing day by day. Currently which stands at approximately 12,000 restaurants. With such a high number of restaurants. This industry hasn’t been saturated yet. And new restaurants are opening every day. However, it has become difficult for them to compete with already established restaurants. The key issues that continue to pose a challenge to them include high real estate costs, rising food costs, shortage of quality manpower, fragmented supply chain and over-licensing. This Zomato data aims at analyzing the demography of the location. Most importantly it will help new restaurants in deciding their theme, menus, cuisine, cost etc for a particular location. It also aims at finding similarities between neighborhoods of Bengaluru on the basis of food.

- Does demography of area matters?

- Does location of particular type of restaurant depends on people living in that area>

- Does theme of restaurant matters?

- Is food chain category restaurant likely to have more customers than its counterpart?

- Are any neighborhood on similar based on the type of food?

- Is particular neighbors is famous for its own kind of food?

- If two neighbors are similar does that mean these are related or a particular group of people live in a neighbourhood or these are places to eat?

- What kind of food is famous in locality.

- Do entire locality loves veg food, if yes then locality populated by a particular set of people eg Jain, Gujarati, Marwadi who are basically veg.

## 1.2 Problem Statement

The dataset also contains reviews for each of the restaurants which will help in finding an overall rating for the place. So we will try to predict rating for a particular restaurant.

## 1.3 Real-world/Business Objectives

We need to predict rating based on different parameters like Average_cost for two people, Online Order available, foods, menu list, most liked dishes etc features.

## 1.4 Machine Learning Formulation

Here we suppose to predict rating of a restaurant, so it is basically a **Regression** problem.

## 1.5 Performance Metric

We will try to reduce Mean Square Error ie **MSE** as minimum as possible. So it is a **Regression** problem reducing **MSE**.

- Ideal MSE is 0.

**2. Data Acquire:**

source: https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants

## 2.1 Data Understanding:

Data is available in .csv format. Dataset columns are as follows,

We have 51717 rows and 17 columns.

## 2.1.1. Data Preprocessing

- Remove Duplicate values
- Remove Null Values

We observed that in ‘dish_liked’ 48.22% data is missing. Similarly in ‘Rate’ colomn, 10.22% data is missing. If we directly throw all NULL data out, we have to ignore 48.22% of original data. Can we somehow fill the missing data? So we can have two approaches.

1. Filling the missing values with appropriate values then operate.

2. Throw all null values and then operate.

1. Filling the missing values with appropriate values then operate.

In ‘reviews_list’ column there are some rating values that are closer to ‘Rate’ column value. So we can fill missing values in ‘Rate’ column with ‘reviews_list’ column.

Similarly, we can fill missing values in ‘dist_like’ column with ‘menu_list’. We have to keep that in mind we are replacing only missing value, not whole column. If ‘menu_list’ and ‘dist_like’ both values are missing then we will throw such rows.

After removing null values and filling missing values we will have 36832 rows data.

## 2.1.2. Data Visualization

- Explore Rate Column. Rate Distribution.

2. Finding Top 20 Restaurant in Banglore City.

3. Online Order service offer by how many restaurants?

4. How many restaurants has a book table service?

5. In Banglore, which area has maximum restaurants.

- Most of the restaurants are located in BTM area, followed by Koramangala.
- Note pie chart explain top 10 locations.

Now we know that we have most of the restaurants are in BTM are let's find an exact number of restaurant in specific area/

5. What type of restaurant are there in banglore? also percentage and counts

These are top 10 types of restaurant in banglore city.

- Quick Bites
- Casual Dining
- Cafes

6. What is the Average cost in restaurants? It is the average cost for two people.

Almost 300–400 is the average cost in all restaurants. It means avg cost for 2 people is 300–400.

7. Which dish are most famous/favourite dish in restaurants?

8. Let's see ‘Rate’ vs ‘Restaurant type’ graph.

9. Top 10 cuisines in Banglore

**3. Model**

Till now we took so much time to understand the data as well we visualize the data, now the actual machine learning part starts from here.

After deep-diving into we can clearly say that ‘online_order’, ‘book_table’, ‘vote’, ‘location’,‘rest-type’, ‘cuisines’ and ‘average_cost’ are important columns rest; we can drop other columns.‘Rate’ is output column as discussed earlier.

## 3.1 Split Data

Always remain we should first split data then apply featurization, to avoid data leakage problems. Divide data into Train, Test part.

## 3.2. Data featurization

We will convert all online_order’, ‘book_table’, ‘location’, ‘rest-type’ and ‘cuisines’ features into Categorical features. Then we will use **one-hot encoding** technique for featurization.

We will add two type of featurization,

1. One-Hot encoding

2. Response coding (Mean Value Replacement)

## 3.3 Build a Random Model (finding worst-case MSE)

**import **random

rand_pred= np.zeros(y_test.shape[0])

**for **i **in **range(y_test.shape[0]):

rand_probs = round(random.uniform(1.0, 5.0),2)

rand_pred[i] = rand_probs

mse(y_test, rand_pred)

In above code, we implemented a random model, which randomly choose values between 1.0 to 5.0. So Random Model gives MSE value = 2.12.

Now, 2.12 is our threshold value or guiding value. If MSE is less than 2.12 value then we can understand the model’s efficiency. So in our case, the ideal minimum MSE value is 0 and if the model gives MSE value greater than 2.12 then the model is worse than random model.

## 3.4 Apply different Models

We will apply Linear Regression

**from** **sklearn.linear_model** **import** LinearRegression

lr = LinearRegression()

lr.fit(X_train,y_train)

y_pred_lr = lr.predict(X_test)

mse(y_test, y_pred_lr)

Then apply SGDRegressor model

**from** **sklearn** **import** linear_model

sgdReg = linear_model.SGDRegressor()

sgdReg.fit(X_train,y_train)

y_pred_sgdr = sgdReg.predict(X_test)

mse(y_test, y_pred_sgdr)

Then finally apply Random Forest Regressor model.

**from** **sklearn.ensemble** **import** RandomForestRegressor

rfr = RandomForestRegressor()

rfr.fit(X_train,y_train)

y_pred_rfr = rfr.predict(X_test)

mse(y_test, y_pred_rfr)

Linear Regression (MSE) = 0.1278,

SGD classifier =2.68e+30

Random Forest Regressor (MSE) = 0.03706

The above value got without **Hyper-param tuning**, so we will perform the same on Random Forest Regressor model because it is learning something.

`tuned_parameters = {'n_estimators': [250,500,1000,1200]}`

grd_regressor = GridSearchCV(RandomForestRegressor(), tuned_parameters, cv=10,

n_jobs=-1, verbose=1, scoring=mse_scorer)

grd_regressor.fit(X_train, y_train)

Now we take a 2nd approach to throw all missing values.

Till now, we have considered ONE-HOT encoding of on below features.

- rest_type

- location

- cuisines

- online_order

- book_table

Here we are going to include below features also,

- dish_liked

- cuisines

Obviously we have to deal with large feature dimensions.

This time we will drop all Null values. Last time we try to save some Null (filling missing values) values by converting them to relative values. But in this run, we will neglect all values null. Initially, there are 51k values by removing NULL it will be somewhere around 23k. Frankly speaking, 23k is also good enough points to experiment.

We will use the same **One-Hot encoding** featurization on 23k data points.

After applying same models on processed data we can get below result,

Linear Regression (MSE) = 0.04308,

SGD classifier =9.86e+28

Random Forest Regressor (MSE) = 0.01542

Again, after hyper-param tuning, with RFR (MSE)= 0.01410

As discussed in section 3.2 we will apply two featurization, we saw one-hot encoded features performance. Now we will do test response coded features.

## 3.2.2 Feature Engineering

Let’s try **response coding **in a categorical variable on the regression model. Basically what we are going to do replace categorical features with response coded features. In simple words, we are going to consider each categorical feature once and find mean value of ‘Rate’ column.

Eg.

Consider “online_order” feature, which has two categories, ‘Yes’ and ‘No’. So we will do a small hack, which is explained below,

1. consider category as ‘Yes’ in ‘online_order’, take a mean value of ‘Rate’

2. similarly consider the second category as ‘No’ in ‘online_order’, take a mean value of ‘Rate’ column.

3. We will perform the above logic using **group_by** on a desired categorical column and simply take a mean of ‘Rate’ column.

4. Then we will create new column which will contain mean values.

5. we will be called it a **MEAN VALUE REPLACEMENT.**

**def** provide_response_coded_features(groupByVal,columnName, df):

*'''*

* This function is used to convert categorical features into response coded features.*

* It simply perform MEAN VALUE REPLACEMENT.*

* '''*

mean_df = df.groupby([groupByVal]).mean()

mean_dict =mean_df['Rate'].to_dict()

key_dict.update([ (groupByVal, mean_dict) ] )

**for** k, v **in** mean_dict.items():

mean_dict[k] = round(v,2)

df[columnName] = df[groupByVal].map(mean_dict)

**return** df

Online_order is a categorical variable that has ‘Yes’ and ‘No’ category.

We calculated mean value of rate column. firstly consider category is ‘Yes’ then calc mean, similarly category is ‘No’ calc mean value. For category is ‘Yes’ mean value is 3.89 and for ‘No’ mean value is 3.93. Then create new column ‘mean_online_order’ place all values. observe 3.89 value all over there where Online_order=Yes.Vice versa

Similarly, we can do for book_table column. Here again, we have two categories. Yes and No. Mean value for No = 3.81 and Mean value for Yes = 4.16.

Create a new column name ‘mean_book_table’ and place all the values.

Similarly, we can do for,

rest_type

location

cuisines

dish_liked features

There is a small question that how can we apply a response coded feature on ‘unseen future’ data. It depends on existing features and datasets. In our case, we can simply ignore such categories.

Again apply three model and result for the same are as follows,

Linear Regression (MSE) = 0.00948

SGD Regression (MSE) = 3.39e+30

Random Forest Regressor = 0.00318

This is brilliant we reduce MSE to 0.003, which is best among all the models we tried till now.

# 4. Model Comparision

Five one-Hot encoded features with missing value filling with approx values

Linear Regression (MSE) = 0.1278,

SGD classifier =2.68e+30

Random Forest Regressor (MSE) = 0.03706

Seven one-hot encoded features (removed null values)

Linear Regression (MSE) = 0.04308,

SGD classifier =9.86e+28

Random Forest Regressor (MSE) = 0.01542

Response coded features (removed null values)

Linear Regression (MSE) = 0.00948

SGD Regression (MSE) = 3.39e+30

Random Forest Regressor = 0.00318

We have built a model using categorical features, But we ignore the most precious feature which ‘review_list’ features. “Review_List” feature is a list of reviews by customers about the restaurant. We can do NLP on this text data. Just read below reviews by customers

`1. I would totally recommend to visit this place once, the place is nice and comfortable food wise alla beautiful place to dine in the interiors take you back to the mughal era the lightings are just perfect. `

2. We went there on the occasion of christmas and so they had only limited items available but the taste and service was not compromised at all the only complaint is that the breads could have been better would surely like to come here again.

3. I was here for dinner with my family on a weekday the restaurant was completely empty ambience is good with some good old hindi music seating arrangement are good too.

4. we ordered masala papad, panner and baby corn starters, lemon and corrionder soup, butter roti, olive and chilli paratha food was fresh and good, service is good too good for family hangout cheers.

## 5. NLP

So after reading the above review let’s try NLP features.

Firstly we will do text preprocessing. Remove “Regex expression”, “StopWords” and “Stemming” and after processing store to dataframe.

We tried Bag of word CountVectorizer featurization. Tried on Random Forest regressor which gives

fromsklearn.feature_extraction.textimportCountVectorizer

count_vect = CountVectorizer(ngram_range=(1,1), min_df=10)#in scikit-learnX_train_bow = count_vect.fit_transform(x_tr_txt)

# train data# test datax_cv_bow = count_vect.transform(x_cv_txt)

x_test_bow = count_vect.transform(x_test_txt)fromsklearn.ensembleimportRandomForestRegressor

rfr = RandomForestRegressor()

rfr.fit(X_train_bow,y_tr)

y_pred_rfr = rfr.predict(x_cv_bow)

mse(y_cv, y_pred_rfr)### which gives MSE 0.04505 ##

Then we tried LSTM model.

*# create the model*

embedding_vecor_length = 256

model = Sequential()

model.add(Embedding(top_words+1, embedding_vecor_length, input_length=max_review_length))

model.add(LSTM(200)) *# returns a sequence of vectors of dimension 32*

model.add(Dropout(0.5))

model.add(Dense(1, activation=**'**linear**'**))

model.compile(loss=**'mean_squared_error'**, optimizer=**'**adam**'**)

print(model.summary())

Followed by tried GRU.

*# create the model*

embedding_vecor_length = 256

model = Sequential()

model.add(Embedding(top_words+1, embedding_vecor_length, input_length=max_review_length))

model.add(GRU(200)) *# returns a sequence of vectors of dimension 32*

model.add(Dropout(0.5))

model.add(Dense(1, activation=**'linear'**))

model.compile(loss=**'mean_squared_error'**, optimizer=**'**adam**'**)

print(model.summary())

# 6. All Features

Let’s try the last experiment.

We have all types of features such as **categorical features, numerical features and text features **(NLP). We can add up all the features and check the output.

- Text data Features:

1. Review List

- Categorical features :

1. Book Table

2. Online Order

3. Location

- Numerical Features:

1. Votes

2. Average cost (average cost for two persons)

3. word count (this is feature engineering hack. We have counted no of words in a sentence of ‘review list’.)

## Plan of attack:

- Step1: For text features, we will use the pre-trained
**Word2Vec**model, introduce word embedding, pass through the LSTM layer then flatten the output. - Step2 : Similarly we will perform embedding on categorical data and flatten it.
- Step3 : Merge all numerical features and scale it.
- Step4 : We will concate all the features under one block.
- Step5 : pass this blog through NN, let’s the output.

*NLP Feature — BoW Random Forest Regressor = 0.045(MSE) NLP Feature — LSTM = 0.05 (MSE)NLP Feature — GRU = 0.044 (MSE)ALL feature — NN = 0.106 (MSE)*

# 7. Summary

Great Learning..!!!

We collect data from CSV file, half of the values were missing, we did not throw up all values, instead of throw NULL value we tried to fill estimate values using related column.

Firstly we design a random model, a **random model** is something which randomly chooses values from 1.0 to 5.0, such model gives **MSE 2.12**.

We tried only 5 one-hot encoded features and try different models Random Forest Regressor was most learning model, so we tune model using **grid-search** technic, **minimal MSE** = 0.03485.

Then we tried with 7 one-hot encoded features and try on different models. Again Random Forest regressor was winning the race.

we achieved **MSE = 0.01404**.

Then we done some **Feature Engineering**, used response coded feature, but this time “Linear Regression” perform well than previous model, Random Forest Regressor is winning the race as usual. we achieved **MSE =0.00353**.

Then we moved focus to NLP features, we tried simple **BOW, LSTM, GRU, and NN** models. single-handedly NLP feature reduces MSE value to a great extent. We then gave last try to combine all features and train a model but that didn’t give expected result. But we have to keep experimenting with the data.

At the end of the day, below model is best among all the versions.

- Random Forest Regressor Response coded Features ==> 0.00318

Github link. Source code can be found here

You can check out the similar interesting blog here

**8. Reference:**

- https://towardsdatascience.com/explaining-feature-importance-by-example-of-a-random-forest-d9166011959e?gi=568927289d00

2. https://www.appliedaicourse.com/

3. https://www.kaggle.com/hindamosh/funny-banglore-restaurants-analysis

Please appreciate our work if you like it.