Exploring New York City to Open an Indian Restaurant

Published in

Analytics Vidhya

9 min readJun 28, 2020

This post deals with the Capstone Project in the IBM Applied Data Science Capstone Course hosted on Coursera. The project makes use of the knowledge gained from the previous courses in the Specialization and applies that to solve a real world data driven problem.

1. INTRODUCTION

I am using the hypothetical scenario for a concept Indian Entrepreneur who wants to open an Indian Restaurant in New York City (NYC). It might present a good opportunity for an Indian American already living in NYC and are well versed with the Places and the Neighborhoods. As Indian cuisine is quite popular with Americans and Indian Americans alike, there are already many restaurants most of which are a Franchise or a family owned business.

Why New York City?
The New York City region is home to the largest Indian American population among metropolitan areas by a significant margin and represents the second-largest metropolitan Asian national diaspora both outside of Asia and within the New York City metropolitan area (source — Wikipedia).

New York City is home to numerous Ethnic groups

Hospitality Industry

Ambience, menu, hygiene and of course taste are all important factors to be kept in mind before getting into the Hospitality Industry but these are all problems that can be tackled internally by the person(s) in charge. The location of a restaurant is also of utmost importance regardless of the history of a business or the taste of the food. If people don’t come in to eat then none of the preparations matter. That is the problem I am tackling in this project.

2. PROBLEM STATEMENT

The objective is to find a suitable location(s) to open an Indian Restaurant in New York City, USA. This project makes use of various Data Science and Machine Learning methodologies (k-means Clustering) to provide a Solution to the client. The project aims to provide a Solution to the Question : ‘Where should you consider opening an Indian Restaurant in New York City?’

3. DATA

I have used the following Data for the completion of the project :

List of Boroughs and Neighborhoods in NYC — This gives the coordinates of all the neighborhoods and is used to call the Foursquare API.
List of Places and Venues in NYC — This contains data about all the nearby venues like Restaurants, Bars, Gym etc.
Demographics of American Indians in New York City — Vital to understand the distribution of the target audience in NYC.
Latitude and Longitude Data of the neighborhood(s) — To plot and visualize our data.

The Data Sources are linked at the end of the post.

4. METHODOLOGY

A) Boroughs

The data section above clearly describes that our NYC data consists of Boroughs (a town or district) and Neighborhoods in these Boroughs. The data contains 5 Boroughs — Queens, Brooklyn, Bronx, Manhattan and Staten Island and over 300 neighborhoods in total. So before we begin our analysis of the Neighborhoods we select an appropriate Borough. This involves looking into all 5 of them. The data is filtered for each Borough and is used to make the call to the Foursquare API.

B) Foursquare API

*The data returned by the API for Brooklyn*

The central part of this project involves making use of the Foursquare API to get various details of nearby venues, like — the Category (Pizza Place, Monument etc), The coordinates of the place (in Latitude and Longitude) and the Name of the Venue. We need to declare our Foursquare credentials like the Client ID and Client Secret. We assume a radius value of 500, which returns venues within a radius of half a kilometer. To prevent too many records being returned by the function call a limit of 100 is set.

The url is constructed with our declared credentials and a request call is made to the API. The data returned is in the form of a json payload. The pandas dataframe is then constructed by reading parts of this data. Therefore 5 data frames are made — one for each Borough

Now that the data has been structured for the preprocessing, we to decide on a Borough for the analysis and so we look into 2 aspects -

Pre-existing Indian Restaurants
Demographics of the Indian American Population

C) Pre-existing Indian Restaurants

Since we wish to open a new Indian Restaurant, it helps to look into ones that are already present. So we get the count of Indian Restaurants (from the Venue Category) in each Borough and merge them together to get an idea of the distribution or concentration of them. Logically, to avoid competition it would make sense to select a Borough with few Indian Restaurants.

It can be seen that Manhattan and Queens have the most number of Restaurants and Staten Island with the least.

D) Demographics of the Indian American Population

An Indian Restaurant would primarily cater to the Indian American population and Indian tourists. So we look into the Indian American population in NYC. The data for the same was scraped from Wikipedia and is from a 2014 American Community Survey (that gathers census data including ethnicity). This helps us narrow down our location for the target population.

The raw data scraped contains some Wikipedia formatting and unnecessary columns that need to be cleaned before it can be used. Once completed, it looks like this -

E) Initial Analysis

Although Queens has the highest population of Indian Americans and the highest % population, we don’t consider it as there are already numerous pre-existing restaurants.
Manhattan has very few Indian Americans with a low % and also has the most no. of Indian Restaurants, so we eliminate it.
Brooklyn seems like a good first choice to begin our analysis as it does not have too many restaurants with a decent Indian Population.
Staten Island can also be looked into next (High population density with very few places).

*Merged data table showing Population and Indian Restaurants*

F) Preprocessing

One-hot Encoding

The data as mentioned above contains details of the nearby venues — Location, Category etc. This data needs to be transformed into a suitable format prior to Clustering. One Hot Encoding is first performed on the ‘Venue Category’ attribute. This is done using the pandas get_dummies() function. Encoding assigns a Nominal Value to our Categorical data so the model does not interpret any numbers as importance or weight.

2. Grouping the Categories

The new data frame is now grouped by Neighborhood and the mean for each Category is taken. This gives an average estimate for each Category in the neighborhood.

*Grouping the Categories by Neighborhood*

Once this is done, we then select only the Indian Restaurants and Neighborhoods as the other Attributes are not of concern to us. This data frame is used to cluster the data points.

G) Clustering

Selecting k-value

The ‘k’ stands for — number of clusters or more specifically the number of centroids. It’s value in k-means Clustering is selected by the “Elbow Method”. The Elbow Method involves plotting the Cost vs k-value; where k is an integer > 1. The point where the curve makes a transition is generally chosen as the k-value. I have used a k value of 3 for the Analysis, although there was a transition at k=2, the cost decreased further at k=3 and this would give us more diverse Clusters to examine.

The aim is to minimize the Within-Cluster-Sum-of-Squares — Cost by using the Inertia criteria in the sklearn library.

Next Clustering is performed and the Cluster Labels are saved.

2. Cluster Labels

The Cluster Labels are merged with the previous data frame containing only Indian Restaurants.

This dataframe is then joined with the Brooklyn Venues dataframe.

H) Clusters

5. RESULTS

Based on the Clustering,

Cluster 2 : has the most number of Indian Restaurants and is therefore not considered.

Cluster 1 : has a medium number of Restaurants.

Cluster 0 : is ideal as no restaurants are present. Therefore we can look into the places in this Cluster.

*Most common Neighborhoods in the Cluster*

Looking at nearby venues, it seems Cluster 0 might be a good location as there are not a lot of Indian restaurants in these areas. There are 60 odd neighborhoods present in the Cluster and the most common ones being Carroll Gardens, South Side, North Side, Downtown and Cobble Hill in Brooklyn.

Therefore our Indian Restaurant can be opened in any of these neighborhoods with little to no competition. Nonetheless, if the food is affordable, authentic and has good taste, I am confident that it will have a great following everywhere.

6. DISCUSSION

Based on the analysis Carroll Gardens, South Side, North Side, Downtown and Cobble Hill are some of the neighborhoods to consider opening our restaurant. I also looked into Staten Island as it had a similar Indian Population as Brooklyn. Since there are only 2 restaurants, competition is very low. Staten Island also has a high density of Indian Americans per sq mile so foot traffic should not be a problem, but the lack of Indian Restaurants can also hint at various other problems like a Licensing, stringent community norms etc, something which should be looked into before making a decision. Manhattan has the most number of Indian Restaurants but the least number of Indian Americans, something that might be interesting to look into.

Some of the drawbacks of this analysis are — the clustering is completely based only on data obtained from Foursquare API and the data about the Indian population distribution in each neighborhood is also based on the 2014 census which is not up-to-date. Thus there is a huge gap in the population distribution data. Even Though there are lots of areas where it can be improved yet this analysis has certainly provided us with some good insights, preliminary information on possibilities & a head start into this business problem by setting the step stones properly.

7. CONCLUSION

We have worked on a business problem like how a real data scientist would. We used python libraries to fetch the data (json, requests etc), to manipulate the contents (pandas) & to analyze and visualize(matplotlib, Folium) those datasets. We have made use of the Foursquare API to explore the venues in neighborhoods of New York, then get data from Wikipedia which we scraped using the pandas library. We also applied machine learning techniques (Clustering) to predict the output given the data and used Folium to visualize it on a map.

Analysis can further be improved by using more recent data and making use of more complex Machine Learning Algorithms. This process however can be used as a baseline and be replicated for other cuisines or gyms, etc.

8. CITATIONS

[1]https://geo.nyu.edu/catalog/nyu_2451_34572

[2]https://foursquare.com/developers/apps

[3]https://en.wikipedia.org/wiki/Indians_in_the_New_York_City_metropolitan_region

The entire project can be found on my Github along with the Course content assignments and relevant data. Hope this was insightful or helpful to you!