Exploratory Data Analysis of Zomato Data in Python: Unveiling Insights for Restaurant Success

18 min readDec 5, 2023

Find the project files, Jupyter notebook with code, and output on the GitHub repository.

Introduction

In the bustling world of online food delivery, platforms like Zomato have become integral to connecting hungry customers with a plethora of dining options. Beyond facilitating online orders, Zomato offers a wealth of information about restaurants, reservations, and more, making it a resource for both food enthusiasts and restaurant owners.

As someone who frequently availed food delivery services in India, the sight of numerous delivery personnel crowding around apartment complexes sparked my curiosity. It led me to ponder a series of questions: What would be the ideal city for opening a restaurant in India? Which cuisine is most popular? What food items are trending, and what price range attracts the most customers? Through extensive analysis, I aimed to uncover insights that could not only satisfy my curiosity but also offer valuable information for anyone considering entering the restaurant business.

Aims and Objectives

In this project, my aim is to provide comprehensive insights for aspiring restaurant owners in India. I will achieve this through:

City Analysis: unveil the best city to open a restaurant
Cuisine Analysis: discover the most-loved cuisines in the market
Dining Type Analysis: understand the dining preferences of potential customers
Best Selling Items Analysis: discover dishes with high customer appeal
Price Range Analysis: determine the sweet spot for pricing dishes
Competitor Analysis: gain insights into the local restaurant landscape

Steps of the Project

Data Preprocessing: clean and organize the dataset, address missing values and refine data quality
Exploratory Data Analysis (EDA) + Feature Engineering: explore the dataset through EDA to uncover patterns, trends, and insights. Simultaneously perform feature engineering to derive new variables relevant to the project objectives.
Drawing Insights: extract meaningful insights to address the aims and objectives

Technology Used

Python: serves as the primary programming language, providing versatility and efficiency in data analysis.

Dataset

I sought a restaurant dataset for analysis and chose Zomato due to its industry prominence in India. As Zomato doesn’t publicly release data, I used Kaggle to find publicly available datasets. Most of them did not fit my requirements, as they either had small data sizes or fewer features. The “Zomato Restaurants Dataset for Metropolitan Areas” [1] from Kaggle fits my requirements perfectly. The dataset, comprising 123,657 rows and 12 columns, provides insights on food, dining, delivery, prices, and more across 13 Indian cities. The data description is given below.

Restaurant Name: name of the restaurant
Dining Rating: rating given by customers for the dining experience at the restaurant
Delivery Rating: rating given by customers for the delivery service provided by the restaurant
Dining Votes: number of votes or reviews received for the dining experience
Delivery Votes: number of votes or reviews received for the delivery service
Cuisine: type of cuisine or culinary style offered by the restaurant
Place Name: name of the metropolitan area
City: name of the metropolitan city where the restaurant is located
Item Name: name of a specific dish or item offered by the restaurant
Best Seller: indicates whether the item is a best-selling dish or not
Votes: number of votes or reviews received for the specific item
Prices: prices associated with each item offered by the restaurant

One limitation of this dataset is that it does not contain data from all over India and is limited to 13 major cities alone. No ethical concerns are associated with using this dataset as it is publicly available on Kaggle.

Pre-requisites

Firstly, I imported all the necessary libraries. A description of each of them is given below.

# data maniupulation
import pandas as pd
import numpy as np

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

#others
from fuzzywuzzy import fuzz
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

Pandas: a powerful data manipulation library in Python that provides data structures for efficiently storing large datasets and tools for working with them
NumPy: a fundamental package for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with mathematical functions to operate on these arrays
Matplotlib: a comprehensive data visualization library in Python that produces static visualizations
Seaborn: a data visualization library based on Matplotlib that provides a high-level interface for creating informative and attractive statistical graphics
Plotly Express: a high-level interface for creating interactive and expressive visualizations. It allows for the creation of dynamic and engaging charts with ease
Fuzzywuzzy: a library for string matching and comparison, offering functions like fuzzy string matching to measure the similarity between strings

Additionally, I had to install the following essential package to ensure the latest features were utilized.

#install packages
!pip install fuzzywuzzy

Read Data

I loaded the dataset and examined the initial five rows along with their respective dataframe information to gain a feel of the data.

# read data
df = pd.read_csv("/content/drive/MyDrive/Portfolio Projects/Zomato Restaurants/zomato_restaurants.csv")

# view top 5 rows
df.head()

# view dataframe info
df.info()

Data Preprocessing

To be able to make use of the data for analysis, it is required to perform a few pre-processing steps.

Change Column Names

In the data preprocessing phase, I began by examining the column names. I noticed variations in casing and spaces between words. To ensure consistency, I converted all column names to lowercase and used underscores to separate words.

# check column names
df.columns

# alter column names  
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

Additionally, I renamed the “votes” column to “item_votes” for uniformity with “dining_votes” and “delivery_votes.”

# rename "votes" column
df = df.rename(columns={'votes': 'item_votes'})

Check Null Values

Addressing null values, I observed that “dining_rating” and “delivery_rating” had missing values, likely due to users not providing information. I replaced these nulls with 0.

# check null values
df.isnull().sum()

# fill null values with 0
df["dining_rating"].fillna(0, inplace=True)
df["delivery_rating"].fillna(0, inplace=True)

Regarding the “best_seller” column, I identified various values such as “BESTSELLER,” “MUST TRY,” “SPICY,” “VEGAN”, etc. To streamline its purpose, I encoded “BESTSELLER” as 1 and assigned 0 to all other values.

# check value counts
df["best_seller"].value_counts()

# encode bestseller as 1, and others as 0
df['best_seller'] = np.where(df['best_seller'] == 'BESTSELLER', 1, 0)

# check value counts after encoding
df["best_seller"].value_counts()

# check null values after the changes
df.isnull().sum()

Check Duplicate Values

Next, to ensure data quality, I checked for and removed duplicate rows.

# duplicate rows
df[df.duplicated()].head()

# number of duplicate rows
df[df.duplicated()].shape[0]

# drop duplicates and check dataframe shape
df = df.drop_duplicates()
print(df.shape)

Check Irregularities in Columns

In addition, I conducted a check for irregularities in the columns. In the “restaurant_name” column, I observed that some restaurants had similar names with slight variations. To identify and verify potential typos, I utilized the fuzzywuzzy library, setting a similarity threshold greater than 85. While there were some results, such as “Santosh Dhaba” and “New Santosh Dhaba,” upon verification, they turned out to be distinct restaurants with different cuisines or branches at separate locations. Consequently, no corrective action was deemed necessary.

# check value counts
df["restaurant_name"].value_counts()

# check for typos in names

restaurant_names = df["restaurant_name"].unique().tolist() # list of unique restaurant names

# initialize a dictionary to store potential typos
potential_typos = {}

# compare each pair of names
for i in range(len(restaurant_names)):
    for j in range(i+1, len(restaurant_names)):
        similarity = fuzz.ratio(restaurant_names[i].lower(), restaurant_names[j].lower())
        if similarity > 85:  # adjust the threshold 
            potential_typos[(restaurant_names[i], restaurant_names[j])] = similarity

# print potential typos
print("Potential Typos:")
for names, similarity in potential_typos.items():
    print(f"Similarity: {similarity}% - Names: {names}")

# cross check restaurants with similar names
df[df["restaurant_name"].isin(['Santosh Dhaba', 'New Santosh Dhaba'])].iloc[[0, -1]]

Furthermore, in the “city” column, I identified a few place names like Banaswadi, Ulsoor, etc. Upon closer inspection, these were found to be places in Bangalore. To ensure uniformity, I standardized these names to “Bangalore.”

# check value counts
df["city"].value_counts()

# change city names
df['city'] = df['city'].str.replace('Banaswadi|Ulsoor|Magrath Road|Malleshwaram', 'Bangalore', regex=True)

# check value counts after changes
df["city"].value_counts()

Exploratory Data Analysis (EDA) + Feature Engineering

This section is dedicated to unravelling insights from the data through a series of visualizations to address the outlined objectives and provide a comprehensive understanding of factors influencing the restaurant industry’s dynamics.

City Analysis

I aimed to identify an ideal city for a new restaurant venture. A bar chart showcasing restaurant frequency revealed Hyderabad as the leader with over 120 restaurants, followed closely by Jaipur, Chennai, and Bangalore, each boasting 80–100 restaurants.

# group by unique restaurants in each city
restaurant_city_counts = df.groupby('restaurant_name')['city'].value_counts().sort_values(ascending=False).reset_index(name='frequency')

# set plot size
plt.figure(figsize=(8,4))

# plot countplot
sns.countplot(data=restaurant_city_counts, x='city', palette='viridis')

# add labels and title
plt.title('Frequency of Restaurants per City')
plt.xlabel('City')
plt.ylabel('Frequency of Restaurants')

# rotate x-axis labels for better visibility
plt.xticks(rotation=45)

# save the plot
plt.savefig('1.1.png')

# show the plot
plt.show()

To delve deeper, I introduced a new metric, “total_votes,” delivery, dining, and item votes. This metric depicts the total engagement for a particular item in a restaurant from the public.

# calculate total_votes from dining_votes, delivery_votes, item_votes
df["total_votes"] = df["dining_votes"] + df["delivery_votes"] + df["item_votes"]

When grouped by city, Chennai emerged as the frontrunner, surpassing Hyderabad and Mumbai.

# group by city
city_votes = df.groupby('city')['total_votes'].sum().sort_values(ascending=False).reset_index()

# set plot size
plt.figure(figsize=(8,4))

# plot barplot
sns.barplot(data=city_votes, x='city', y='total_votes', palette='viridis')

# add labels and title
plt.title('Total Votes by City')
plt.xlabel('City')
plt.ylabel('Total Votes')

# rotate x-axis labels for better visibility
plt.xticks(rotation=45)

# save the plot
plt.savefig('1.2.png')

# show the plot
plt.show()

Chennai’s ability to achieve high engagement with comparatively fewer restaurants suggested its potential for new establishments. Hence, Chennai was chosen as the ideal city for a new restaurant venture.

Cuisine Analysis

After narrowing down the data to Chennai city, I proceeded to group the data by restaurant_name and cuisine.

# filter data for Chennai city
df_chennai = df[df['city'] == 'Chennai']

Subsequently, I created a plot depicting the frequency count of each cuisine.

# group by restaurant name and cuisine
chennai_cuisine_counts = df_chennai.groupby(['restaurant_name', 'cuisine']).size().reset_index(name='frequency')
cuisine_value_counts = chennai_cuisine_counts['cuisine'].value_counts()

# set plot size
plt.figure(figsize=(8, 4))

# plot barplot
sns.barplot(x=cuisine_value_counts.index, y=cuisine_value_counts.values, palette='viridis')

# add labels and title
plt.title('Cuisine Value Counts in Chennai')
plt.xlabel('Cuisine')
plt.ylabel('Frequency')

# rotate x-axis labels for better visibility
plt.xticks(rotation=45)

# save the plot
plt.tight_layout()
plt.savefig('2.1.png')

# show the plot
plt.show()

The initial plot revealed 19 diverse cuisines, with “Beverages” leading. However, the data presented challenges, as restaurants often offered more than one cuisine, causing confusion. Additionally, there was an overlap of cuisine types; for instance, a restaurant could be tagged as “Kerala Cuisine,” while another category existed as “South Indian”. To address this, I manually examined each cuisine, displaying associated food items. Based on the majority of food item cuisines, tags were adjusted, and certain cuisines were clubbed to enhance clarity.

# display restaurant names for each cuisine
df_chennai[df_chennai['cuisine'] == 'Kebab'].groupby('restaurant_name').first()

# check the food item for each restaurant 
df_chennai[(df_chennai['restaurant_name'] == "Ibrahim Biriyani")][["restaurant_name", "item_name"]]

# create a mapping for the incorrect restaurant's cuisine
restaurant_cuisine_mapping = {
    'The Chef Company': 'South Indian',
    'Basheer Bai Biriyani': 'Biryani',
    'Five Star Chicken': 'Fast Food',
    'A2B - Adyar Ananda Bhavan': 'South Indian',
    'Potful - Claypot Biryanis': 'Biryani',
    'Zam Zam Briyani': 'Biryani',
    'Armaani Biryani': 'Biryani',
    'Salem RR Biriyani Unavagam': 'Biryani',
    'WOW! China': 'Chinese',
    'A1 Ambur Briyani & Kababs': 'Biryani',
    'Khalids Biriyani': 'Biryani',
    'Little Hut': 'Biryani',
    'Madurai Pandian Mess': 'South Indian',
    'Ibrahim Biriyani': 'Biryani',
    'Keifer': 'Beverages',
    'Mathsya': 'South Indian',
    'Sangeetha Veg Restaurant - Gandhi Irwin Road': 'South Indian',
    'The Burger Cafe': 'Fast Food',
    'Ambur Star Briyani': 'Biryani',
    'BBQ Master': 'Chinese',
    'Bismillah Briyani & Fast Foods': 'Biryani',
    'Navrang Restaurant': 'North Indian',
    'Nxtdoor': 'Chinese',
    'Savoury Sea Shell': 'Chinese',
    'Taj Mughalai': 'North Indian',
    'Ambaram Fast Food': 'Fast Food',
    'Little India': 'North Indian',
    "AB's - Absolute Barbecues": 'BBQ',
    "UBQ by Barbeque Nation": 'BBQ',
    "Domino's Pizza": 'Fast Food',
    'Oyalo Pizza': 'Fast Food',
    'Little Italy': 'Fast Food',
    "Shero Home Food - Chettinad": 'South Indian',
    'The Old Mirchi Biryani': 'Biryani',
    'Wangs Kitchen': 'Chinese',
    'Riyas Biriyani': 'Biryani',
    'Tower Burger': 'Fast Food',
    "Writer's Cafe": 'Cafe',
    "Toscano": 'Cafe',
    'The High Joint': 'Fast Food',
    "Thanjai Veeraswamy": 'South Indian',
    "Samco": 'South Indian',
    'Ponnusamy Hotel': 'Chinese',
    'Meat And Eat': 'Fast Food',
    'Liza Restaurant': 'North Indian',
    'Kim Ling Chinese Restaurant': 'Chinese',
    'KMS Hakkim Kalyana Biriyani': 'Biryani',
    'Junior Kuppanna': 'South Indian',
    'Hotel Sennthur': 'Fast Food',
    'Hotel Paramount': 'South Indian',
    'Grill A Delics': 'Chinese',
    'Great Indian Khichdi by EatFit': 'North Indian',
    'EatFit': 'North Indian',
    'Delhi Dhaba': 'North Indian',
    "Chidambaram New Moorthy Cafe": 'Cafe',
    "Cafe Amin": 'Cafe',
    'BurgerMan': 'Fast Food',
    'Babal Da Punjabi Dabha': 'North Indian',
    'Al Faham': 'Biryani',
    'Al Baik': 'Fast Food',
    'Anjappar': 'South Indian',
    'Buhari Vintage Hotel': 'Biryani',
    'Tibet Momo - World Class Pan Asian Cuisine': 'Fast Food',
    'The Kati Roll Shop': 'Fast Food',
    "Reena Paul's Kitchen": 'South Indian',
    "Agarwal's Desi Khana Khazana": 'North Indian'
}

for restaurant_name, cuisine in restaurant_cuisine_mapping.items():
    df_chennai.loc[df_chennai['restaurant_name'] == restaurant_name, 'cuisine'] = cuisine

# plot the cuisine value counts again

# group by restaurant name and cuisine
chennai_cuisine_counts = df_chennai.groupby(['restaurant_name', 'cuisine']).size().reset_index(name='frequency')
cuisine_value_counts = chennai_cuisine_counts['cuisine'].value_counts()

# set plot size
plt.figure(figsize=(8, 4))

# plot barplot
sns.barplot(x=cuisine_value_counts.index, y=cuisine_value_counts.values, palette='viridis')

# add labels and title
plt.title('Cuisine Value Counts in Chennai')
plt.xlabel('Cuisine')
plt.ylabel('Frequency')

# rotate x-axis labels for better visibility
plt.xticks(rotation=45)

# save the plot
plt.tight_layout()
plt.savefig('2.2.png')

# show the plot
plt.show()

The refined bar plot showcased 10 cuisines, with 23 fast food, 21 biryani, and 14 South Indian restaurants.

To assess each cuisine’s potential in Chennai, additional metrics were incorporated. A bar plot with “total_votes” as hue revealed Chinese cuisine, prevalent in only 8 restaurants, attracting the highest engagement, followed by Biryani and Fast Food.

# plot cuisine vs restaurant count with total_votes as hue

# group by cuisine and get count of restaurants
restaurant_count = df_chennai.groupby('cuisine')['restaurant_name'].nunique().reset_index()

# group by cuisine and get total votes
total_votes = df_chennai.groupby('cuisine')['total_votes'].sum().reset_index()

# merge the two dataframes on cuisine
merged_data = pd.merge(restaurant_count, total_votes, on='cuisine', how='inner')

# create the plot 
fig = px.bar(merged_data, x='cuisine', y='restaurant_name', color='total_votes',
             labels={'cuisine': 'Cuisine', 'restaurant_name': 'Restaurant Count', 'total_votes': 'Total Votes'},
             title='Restaurant Count and Total Votes by Cuisine in Chennai',
             color_continuous_scale='RdYlGn')

# order in descending
fig.update_xaxes(categoryorder='total descending')

# set the plot size
fig.update_layout(width=800, height=500)

# show the plot
fig.show()

Subsequent visualization, incorporating “best_seller” items as hue, revealed that restaurants with the highest frequency also had the highest number of best-selling items. Chinese cuisine, despite having only 80 best-selling items, commanded significant engagement.

# plot cuisine vs restaurant count with best_seller as hue

# group by cuisine and get count of restaurants
restaurant_count = df_chennai.groupby('cuisine')['restaurant_name'].nunique().reset_index()

# group by cuisine and get total votes
total_votes = df_chennai.groupby('cuisine')['best_seller'].sum().reset_index()

# merge the two dataframes on cuisine
merged_data = pd.merge(restaurant_count, total_votes, on='cuisine', how='inner')

# create the plot
fig = px.bar(merged_data, x='cuisine', y='restaurant_name', color='best_seller',
             labels={'cuisine': 'Cuisine', 'restaurant_name': 'Restaurant Count', 'best_seller': 'Best Seller'},
             title='Restaurant Count and Best Seller Count by Cuisine in Chennai',
             color_continuous_scale='RdYlGn')

# order in descending
fig.update_xaxes(categoryorder='total descending')

# set the plot size
fig.update_layout(width=800, height=500)

# show the plot
fig.show()

Given the lower number of Chinese restaurants, there appears to be an opportunity in this category. Therefore, opening a Chinese restaurant in Chennai seems to be the most viable option.

Dining Type Analysis

For the dining type analysis, I started by visualizing the relationships among ‘dining_rating’, ‘dining_votes’, ‘delivery_rating’, and ‘delivery_votes’ using a heatmap.

# plot heatmap between 'dining_rating', 'dining_votes', 'delivery_rating', 'delivery_votes'

# select the columns
correlation_data = df_chennai[['dining_rating', 'dining_votes', 'delivery_rating', 'delivery_votes']]

# calculate the correlation matrix
correlation_matrix = correlation_data.corr()

# set plot size
plt.figure(figsize=(4, 4))

# create the heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='Reds', fmt=".2f")

# add title
plt.title('Correlation Heatmap')

# save the plot
plt.tight_layout()
plt.savefig('3.1.png')

# show the plot
plt.show()

The observations suggested no clear trend, but a hypothesis emerged that customers tend to provide reviews for dining experiences while favoring ratings for delivery experiences. To validate this, I examined the frequency of zeros in each column, confirming that dining rating had more zeros than delivery rating, and vice versa for votes.

# count the number of zeros in each column
num_zeros_dining_rating = (df_chennai['dining_rating'] == 0).sum()
num_zeros_delivery_rating = (df_chennai['delivery_rating'] == 0).sum()
num_zeros_dining_votes = (df_chennai['dining_votes'] == 0).sum()
num_zeros_delivery_votes = (df_chennai['delivery_votes'] == 0).sum()

# print the results
print(f'Number of zeros in dining_rating: {num_zeros_dining_rating}')
print(f'Number of zeros in dining_votes: {num_zeros_dining_votes}')
print(f'Number of zeros in delivery_rating: {num_zeros_delivery_rating}')
print(f'Number of zeros in delivery_votes: {num_zeros_delivery_votes}')

Leveraging this insight, I created two new columns, ‘dining_preferred’ and ‘delivery_preferred,’ identifying preferences based on votes and ratings in the top 0.75 quantile with a rating greater than 4.

# create 2 new columns dining_preferred & delivery_preferred if thier respective votes are in top 0.75 quantile and rating is greater than 4

# dining_preferred
threshold_dining_votes = df_chennai.groupby('cuisine')['dining_votes'].quantile(0.75)

df_chennai['dining_preferred'] = df_chennai.apply(
    lambda row: 1 if (row['dining_rating'] >= 4 or row['dining_votes'] >= threshold_dining_votes[row['cuisine']]) else 0,
    axis=1
)

# delivery_preferred
threshold_delivery_votes = df_chennai.groupby('cuisine')['delivery_votes'].quantile(0.75)
df_chennai['delivery_preferred'] = df_chennai.apply(
    lambda row: 1 if (row['delivery_rating'] >= 4 or row['delivery_votes'] >= threshold_delivery_votes[row['cuisine']]) else 0,
    axis=1
)

Subsequently, I introduced a new column, ‘restaurant_type,’ classifying establishments as dining, delivery, both (dining + delivery), or takeaway based on the preferences identified.

# create the restaurant_type column based on conditions
df_chennai['restaurant_type'] = 'Takeaway'
df_chennai.loc[(df_chennai['delivery_preferred'] == 1) & (df_chennai['dining_preferred'] == 0), 'restaurant_type'] = 'Delivery'
df_chennai.loc[(df_chennai['delivery_preferred'] == 0) & (df_chennai['dining_preferred'] == 1), 'restaurant_type'] = 'Dining'
df_chennai.loc[(df_chennai['delivery_preferred'] == 1) & (df_chennai['dining_preferred'] == 1), 'restaurant_type'] = 'Dining & Delivery'

Redundant columns were then removed for clarity and the data was filtered with Chinese cuisine.

# drop redundant columns
df_chennai = df_chennai.drop(['dining_rating', 'delivery_rating', 'dining_votes', 'delivery_votes', 'dining_preferred', 'delivery_preferred'], axis=1)
df_chennai.head()

# filter data for chinese cuisine in chennai
df_chennai_chinese = df_chennai[df_chennai['cuisine'] == 'Chinese']

A pie chart was employed to showcase the distribution of restaurant types, revealing that delivery and dining + delivery held an equal share.

# visualize the distribution of restaurant_type
restaurant_type_counts = df_chennai_chinese.groupby('restaurant_type')['restaurant_name'].nunique()

# set plot size
plt.figure(figsize=(4, 4))

# plot pie chart
plt.pie(restaurant_type_counts, labels=restaurant_type_counts.index, autopct='%1.1f%%', startangle=140, colors=sns.color_palette('viridis'))

# add title
plt.title('Restaurant Count by Type for Chinese Cuisine in Chennai')

# save the plot
plt.tight_layout()
plt.savefig('3.2.png')

# show the plot
plt.show()

Further analysis, incorporating total_votes and best_seller as hue in separate bar charts, highlighted that dining + delivery restaurants garnered the highest engagement and boasted the most best-selling items.

# plot restaurant type vs restaurant count with total_votes as hue

# group by restaurant type
biryani_restaurant_type_counts = df_chennai_chinese.groupby('restaurant_type')['restaurant_name'].nunique().reset_index()

# add a new column for total votes
biryani_restaurant_type_counts['total_votes'] = df_chennai_chinese.groupby('restaurant_type')['total_votes'].sum().values

# create bar plot
fig = px.bar(biryani_restaurant_type_counts, x='restaurant_type', y='restaurant_name',
             color='total_votes', color_continuous_scale='Viridis',
             labels={'restaurant_type': 'Restaurant Type', 'restaurant_name': 'Count', 'total_votes': 'Total Votes'},
             title='Restaurant Count by Type for Biryani in Chennai with Total Votes as Hue')

# order in descending
fig.update_xaxes(categoryorder='total descending')

# set the plot size
fig.update_layout(width=800, height=500)

# show the plot
fig.show()

# plot restaurant type vs restaurant count with best_seller as hue

# group by restaurant type
biryani_restaurant_type_counts = df_chennai_chinese.groupby('restaurant_type')['restaurant_name'].nunique().reset_index()

# add a new column for total votes
biryani_restaurant_type_counts['best_seller'] = df_chennai_chinese.groupby('restaurant_type')['best_seller'].sum().values

# create bar plot
fig = px.bar(biryani_restaurant_type_counts, x='restaurant_type', y='restaurant_name',
             color='best_seller', color_continuous_scale='Viridis',
             labels={'restaurant_type': 'Restaurant Type', 'restaurant_name': 'Count', 'best_seller': 'Best Seller'},
             title='Restaurant Count by Type for Biryani in Chennai with Best Seller as Hue')

# order in descending
fig.update_xaxes(categoryorder='total descending')

# set the plot size
fig.update_layout(width=800, height=500)

# show the plot
fig.show()

Consequently, for the chosen Chinese restaurant in Chennai, offering both dining and delivery options would be the optimal strategy.

Best Selling Items Analysis

I filtered the best-selling items based on the highest votes and presented the top 10 items in ascending order.

# identify the best selling items based on item_votes

# filter the best selling items
filtered_data = df_chennai_chinese[df_chennai_chinese['best_seller'] == 1]

# group data by item_name and calculate the sum of item_votes
grouped_data = filtered_data.groupby('item_name')['item_votes'].sum().reset_index()

# sort the data by item_votes in descending order
sorted_data = grouped_data.sort_values(by='item_votes', ascending=False)

# 10 best selling items
sorted_data[['item_name', 'item_votes']].head(10)

item_name item_votes
Chicken Fried Rice 549
Grilled Chicken 514
Chicken Fried Rice Combo 197
Tandoori Chicken 166
Dragon Chicken 144
Chicken Biryani with Chicken 65 Combo 140
BBQ Chicken 117
Chicken Schezwan Fried Rice 102
Chicken Malabar Biryani 86
Chilli Chicken 79

Notably, Chicken Fried Rice and Grilled Chicken emerged as the best-selling items with the highest engagement. The majority of items in the top 10 are chicken-based, underscoring their popularity among non-vegetarian customers. Additionally, a couple of biryani dishes made it to the top 10, suggesting that these restaurants likely specialize in both Chinese and biryani cuisine.

This list provides insights into the most popular items among the public, offering a clear understanding of the dishes that resonate the most with customers.

Price Range Analysis

I plotted a box plot to analyse the ideal price range for selling the best selling chinese cuisine items in Chennai.

# plot boxplot of prices of all items

fig = px.box(df_chennai_chinese, x='prices', title='Box Plot of Prices of all Items')
fig.update_layout(xaxis_title='Prices')

# set the plot size
fig.update_layout(width=800, height=500)

# show the plot
fig.show()

It revealed that the median price for all items is 260 rupees, with the majority falling within the range of 195 to 355 rupees. However, several outliers are present, indicating dishes with significantly higher prices, with one reaching as high as 2699 rupees.

When focusing solely on the best-selling dishes, the median price increases to 310 rupees.

# plot boxplot of prices of best-selling items

fig = px.box(df_chennai_chinese[df_chennai_chinese['best_seller'] == 1], x='prices', title='Box Plot of Prices for Best Sellers')
fig.update_layout(xaxis_title='Prices')

# set the plot size
fig.update_layout(width=800, height=500)

# show the plot
fig.show()

Most of these items are priced between 260 and 365 rupees, suggesting that despite the higher costs, customers are willing to pay a premium for these popular dishes.

Competitor Analysis

In the evaluation of Chinese restaurants in Chennai, “Savoury Sea Shell” stands out as a leader with 4151 votes, 14 best-selling items, and an average price of 342 rupees — falling within the range identified for best-selling items. “Kim Ling Chinese Restaurant” and “BBQ Master” closely follow with approximately 1400 votes each, offering a respectable 12 best-selling items. Their average prices of 293 and 215 rupees, respectively, align with the majority price range for popular items.

# group data by restaurant name
grouped_data = df_chennai_chinese.groupby('restaurant_name')

# calculate the number of best-selling items and average price for each restaurant
summary_data = grouped_data.agg(
    total_votes=pd.NamedAgg(column='item_votes', aggfunc='sum'),
    best_selling_items=pd.NamedAgg(column='best_seller', aggfunc='sum'),
    average_price=pd.NamedAgg(column='prices', aggfunc='mean')
)

# sort the data in descending order based on total votes
summary_data = summary_data.sort_values(by='total_votes', ascending=False)

# show the results
summary_data[['total_votes', 'best_selling_items', 'average_price']]

Wangs Kitchen shares the distinction of having 14 best-selling items but stands out with only 76 votes, suggesting potential as a premium restaurant. However, its average price of 403 rupees is on the higher side.

In summary, each competitor provides valuable insights, offering potential learnings for optimizing engagement, best-selling items, and pricing strategies.

Conclusions

This project has successfully accomplished its aim of providing thorough insights for aspiring restaurant owners, addressing key facets crucial for business success.

The City Analysis revealed the best-suited locations for opening a restaurant, taking into account factors like engagement and competition. Chennai surfaced as the optimal location, showcasing substantial engagement potential even with lesser restaurants.

Cuisine Analysis delved into customer preferences, highlighting the most-loved cuisines in the market, guiding restaurant owners in strategic menu planning. Chinese cuisine emerged as the most promising culinary option, marked by significant engagement and best-selling items, indicating a lucrative market opportunity.

Dining Type Analysis offered valuable insights into the customer’s inclination towards dining or delivery services. Versatility in service, offering both dining and delivery options, proved to be a key factor in maximizing engagement.

Best Selling Items Analysis identified dishes with high customer appeal, aiding menu curation. These items predominantly featured chicken-based dishes, guiding strategic menu decisions.

Price Range Analysis determined the optimal pricing strategy for dishes, and an optimal pricing strategy for Chinese cuisine in Chennai fell within the range of 260 to 365 rupees, striking a balance between customer acceptance and profitability.

Lastly, the Competitor Analysis provided a comprehensive understanding of the local restaurant landscape, offering insights and lessons from existing players. Savoury Sea Shell led the competition with high engagement, numerous best-selling items, and competitive pricing.

These collective findings equip prospective restaurant owners with a well-rounded understanding, enabling them to make informed decisions and navigate the intricacies of the restaurant business successfully.

Future Work

Improvements in the Current Project:

While the current project lays a foundation for restaurant analysis, several enhancements can elevate its accuracy and comprehensiveness.

Data Authenticity: Consider using the Zomato API to directly source data, ensuring real-time and authenticated information instead of relying on a contributed dataset from Kaggle.
Broader Dataset: Expand the dataset beyond the 13 major cities to provide a more comprehensive representation of the diverse restaurant landscape across India.
Sales and Profit Analysis: While engagement is assessed through total votes, incorporating data on sales and profits for each food item or restaurant would offer a more accurate measure of success.
Address Data Ambiguity: Mitigate data ambiguities, such as overlapping cuisine categories or miscellaneous information in the “best_seller” column, for more precise categorization and analysis.
Holistic Analyses: Extend analyses to cover additional factors like customer demographics, seasonal trends, and external economic factors for a more comprehensive understanding of the restaurant business.

Recommendations for Future Work:

Looking forward, exploring additional avenues can enrich the depth and scope of restaurant data analysis, providing valuable insights for future endeavors.

Sentiment Analysis: Extract and analyze review columns to incorporate sentiment analysis, providing insights into customer satisfaction and preferences.
Real-time Data Feeds: Explore the integration of real-time data feeds to enable dynamic adjustments to menu offerings and pricing strategies, aligning with current market trends and customer preferences.

References / Resources Used:

[1] Dataset — Zomato Restaurants Dataset for Metropolitan Areas

Connect on LinkedIn to share suggestions, collaborate, or just drop a casual “hi.” Look forward to hearing from you!