New York City — The battle of the Neighbourhoods

Patrick Eidemüller
12 min readAug 8, 2020

--

IBM Data Science Capstone Project

Analyzing and visualizing the structure of New York City in relation to the requirements of a Clothing Store Investor

This pictures shows the New York City colourful from above in the morning with the Empire State Building in the center.
New York City — Empire State Building

Introduction

I have taken the time in my semester break to acquire more skills related to data science by completing the IBM Data Science Professional Certificate course on Coursera. The last module is a capstone project and the highlight of the course, it includes the greatest learning effect by applying the learned skills on an individual real life problem. As you will see from the business problem part, I decided to add some more complexity to the standard course assignment .

I present here the summery of my project and explain different methods briefly. If you are interested in the extended code check out my jupyter notebook.

Business Problem

The project is based on a hypothetical business case. A Canadian Investor who recently made a fortune with an investment in a Clothing Store in Toronto wants to repeat his idea in New York City.

  1. As his brand is exclusive and expensive the location should be one of the most crowded districts with high employment rate and above average income. He prefers not only tourists to buy in the store he would also like to gain many regular customers.
  2. Due to the origin of his brand has a touch of Italian design he prefers a location close to Italian restaurants on the basis of window shopping and the chance that people who go for Italian food also have a sympathy for Italian fashion is pretty high.
  3. Tourists and business traveler are well known for spending money generously, therefore the criteria to be as close as possible to hotels is highly important, because guests of the city hotels are more likely to buy clothes nearby and guarantee for more walk-in customers.
  4. As close to the city Center or other touristic hotspots to benefit from walk-in customers. Approximately 20 Minutes walking distance to the Center of the district. If possible far away from other clothing stores.
  5. The Investor wishes to invest in a flat in New York City to be nearby the store. By the reason to live close to the store he has the following criteria to his place of residence: low crime rate, high community trust, close to parks, theatres and art galleries.

The Investor first wants a macro overview of New York City. So we are exploring the community districts.

1. Business Problem Understanding

The Project seems very clear, find the perfect district for an Italian brand clothing store, taking into account the location should be suitable to his imaginations of the perfect place of residence, where you feel safe at the same time.

2. Analytical Approach

The core of the project will be the socio-economic data frame. Complementary we build a venues data frame fetched from foursquare and explore these venues. The final venues frame will contain the most common venues of each district, which we will get through one hot encoding. This data frame is the basis for the k-means algorithm to cluster the districts by their features to compare similarity between these districts.

features weighted matrix which quantifies the requirements of the investor with a weight from 0 to 1
features weighted matrix

For the best result the analytical solution to the business problem is to quantify and evaluate the thoughts of the client to full fill his requirements completely. For evaluating his criteria, we will create a features weighted matrix to express the investors desires in a scientific way, which we will multiply with the normalized final data frame to add the extra column with the weighted results, which gives us an indication of the best districts.

3. Data requirements and collection

To ensure the best location for the store I decided to add some more complexity to the standard course problem. As you can see from the criteria given by the investor we need some more data.

In the beginning of the Project I found data from many different data sources, but decided to get the data mainly from cccnewyork.org by the reason that the source of their Data is the U.S. Census Bureau and the data was fetched by the American Community Survey https://data.census.gov/. So we can be sure the data is up to date, consistent and reliable.

  • the socio-economic data will be obtained from various csv files from cccnewyork.org
  • the venues will be fetched from Foursquare through an API
  • the Geo-coordinates will be obtained with nominatim and geopy

4. Data understanding and preparing

First of all, we will build a clean socio-economic data frame with all the necessary information which are related to the business problem. Therefore, we need to load all the files and drop all unnecessary columns and rows.

final socioeconomic data frame consisting of income, population, crime, trust, unemployment, latitude and longitude
final socio economic data frame

Getting latitudes and longitudes with geocoder

As mentioned before geocoder is a great tool to get the latitude and longitude. If you query is too large you can use the rate limiter for fetching larger amounts of queries successful:

from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
D_lat = []
D_long = []
for district in socio_test["Area"]:

location = geolocator.geocode(district)
if location:
d_latitude = location.latitude
d_longitude = location.longitude
D_lat.append(d_latitude)
D_long.append(d_longitude)
else:
print(district)
#Append to df
socio["D_lat"] = D_lat
socio["D_long"] = D_long

Mapping with Folium

Folium is a great package to make beautiful maps. We will use it for a general overview of the districts of New York City to get familiar with the structure of the City and for interactive choropleth maps.

map_NY_Nsimple = folium.Map(location=[40.730610, -73.935242], zoom_start=10)# for each Community District add a marker to map
for lat, long, district in zip(socio[‘D_lat’], socio[‘D_long’], socio[‘Area’]):
label = ‘{}’.format(district)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, long],
radius=15,
popup=label,
color=’#3186cc’,
fill=True,
fill_color=’#3186cc’,
fill_opacity=0.7,
parse_html=False).add_to(map_NY_Nsimple)

map_NY_Nsimple
simple map of the 59 community districts New York City
simple map of the 59 community districts New York City

By adding chloropleth layers for each columnm, the map gets more interactive and informative. Simply add a layer for each column similar to the code below.

income = map_NY_N.choropleth(
geo_data=nyc_geo,
data=socio,
columns=['boro_cd', 'Income'],
key_on='feature.properties.boro_cd',
fill_color='OrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='income',
smooth_factor=0,
name = "income",
highlight = True)
choropleth heatmap of New York City with checkbox to visualize by clicking different columns
choropleth heatmap with checkbox for visualizing different columns

With the Layer Control you get the checkbox for selecting the different data.

folium.LayerControl(collapsed=False).add_to(map_NY_N)

Getting the venue data with Foursquare

With Foursquare we can get up to 100 venues for each district, which is great for a free service. We will fetch the data and create a venues data frame, the pandas build in method .get_dummies lets us easily use the one hot encoding process to quantify the venues. After grouping the frame by the districts and calculating the mean value we can compare the different districts perfectly. In the jupyter Notebook you can comprehend the venue exploring detailed. But what is one hot encoding again?

One Hot Encoding is a process in the data processing that is applied to categorical data, to convert it into a binary vector representation for use in machine learning algorithms

One-Hot Encoding simply creates one column for every possible value and put a 1 or 0 in the appropriate column.

the picture shows a one hot encoded heatmap data frame with the mean values, which highlights the occurrence of the features
one hot encoded heatmap data frame with mean values

The most common venues

For the comparison of the districts we would like to create a table with a function which gives us the most common venues of each district. We can use this function later to explore the different cluster by their venues.

Data Frame of the most common venues by districts
data frame of the most common venues by community districts

Heat-map of the target venues

Related to the requirements of the customer we will have a closer look at the distribution of the Clothing Stores, Boutiques and Italien Restaurants in the City.

Coloured heat map which shows the Distribution of Clothing Stores in the Community Districts of New York City
Coloured heat map which shows the Distribution of Boutiques in the Community Districts of New York City
Coloured heat map which shows the Distribution of Italien Restaurants in the Community Districts of New York City
Distribution of Clothing Stores, Boutiques and Italien Restaurants in the Community Districts of New York City

5. Analysing and Modelling

You can find the detailed code here

This project has a need for data analysing through data exploring we will only use a simple classification algorithm but the main part is not about a machine learning model. We will use the k-means clustering followed by more data exploring and visualisation to expand our feeling for the data and understanding of the city.

K-means is a method that aims to partition n data points into k clusters where each data point is assigned to the cluster with the nearest mean. The goal is to minimize the sum of all squared distances within a cluster.

To find the perfect number of cluster the most common approach is the elbow method. Therefore we run the algorithm multiple times and then plotting the related score.

elbow curve for determining the perfect number of clusters for k-means
The elbow method for determining number of clusters

As you can see the elbow method is sometimes not very conclusive. But there are numerous different methods to determine the best number of clusters. The second method I used ist the Silhouette coefficient.

The Silhouette coefficient is calculated using the mean intra-cluster distance and the mean nearest-cluster distance for each sample. For each point p, first find the average distance between p and all other points in the same cluster this is a measure of cohesion (A). Then find the average distance between p and all points in the nearest cluster, this is a measure of separation from the closest other cluster (B). The silhouette coefficient for p is defined as the difference between B and A (B-A) divided by the greater of the two (max(A,B))

silhouette coefficient for determining the perfect number of clusters for k-means
The Silhouette coefficient method for determining number of clusters

There are numerous quantitative methods of evaluating clustering results, you will see by using them as tools with the full understanding of the limitations the combination of contrasting methods rises the quality of your choice, if you be aware of actually examine the results, kind of a human inspection and making a determination based on an understanding of what the data represents, what a cluster represents, and what the clustering is intended to achieve, you will find the perfect number of clusters.

This is the clustered map of each Community District by the venue structure and similarity.

map of each Community District by the venue structure and similarity.

Analyzing the Investor requirements

where solving the Business Problem begins

The clustered map above includes all venues we have fetched from Foursquare including the irrelevant venues, except the socio economic data. For the quality of the result it is important to deal only with relevant features, which have an impact on the decision of the Investor. Beginning from this part we will deploy the recently mentioned features weight matrix.

Initially we prepare and merge the data frames to include only the necessary columns.

data frame of all customer requirements
data frame of all customer requirements before feature scaling

For the next part Feature Scaling is very important.

Feature scaling is a technique to change the values of columns in the dataset to use a common scale, without losing information or distorting the differences in the ranges of the values. This can be achieved through Normalization and Standardization

Normalization is a scaling technique which rescales the features so that the data will fall in the range of [0,1] to bring them to a comparable grade.

Standardization is a scaling technique which rescales the features the way they range between [-1,1] by the properties of a standard normal distribution with the mean μ=0 and the standard deviation, σ=1, where μ is the average and σ is the standard deviation from the average.

So after the normalization and setting the index on the Area our data frame looks like the following:

data frame of all customer requirements after feature scaling
data frame of all customer requirements after feature scaling

Now we can multiply the features weight matrix and calculate the total score column, with some simple visualisation the data frame looks pretty informative.

heat map data frame based on the feature scaled frame multiplied by the weighted matrix
heat map data frame based on the feature scaled frame multiplied by the weighted matrix

After applying the k-means method featuring this data frame (dropping the total score column) won’t get a visualization of the best districts numerically, but it shows us which districts are similar in accordance to the investor requirements. We will repeat the same process as mentioned before, finding the perfect number of clusters with the two method previously explained.

NYC map showing the clusters based on the scaled and weighted data frame
NYC map showing the clusters based on the scaled and weighted data frame

Red cluster 0 ist the medium level cluster the total mean of the features is mediocre. The mean total score is 1.39, but it’s notable that it includes 5 of the top scored districts, especially South Beach and Tottenville, which are located in Staten Island. There are also 3 high ranked districts from Manhatten included. The rest of the cluster is moderate.

The purple Cluster 1 is the high ranked Cluster it consists of only 2 districts with an median total score of 1.97. The districts of this cluster Battery Park and Midtown Business District scoring with a high occurrence of hotels but low population.

The blue Cluster 2 is the substandard faction, with a median total score of 0.67 and except of the population, trust and parks the mean values are very low.

6. Evaluation

As you may see presenting the customer a clustered map is not a result, which is a good foundation for finding the perfect location of the Store. But the weighted heatmap is great to work with. We are going to explore this data frame further. Presenting a map with the total scores has much more information for the decision of the customer, combining this map with the choropleth map of the socio economic data is superb to visualize the data frames interactive.

heat map data frame of the top 15 community districts
heat map data frame of the top 15 community districts

Visualize the total score

choropleth visualization of the community districts by the total score
chloropleth map of the total scores of the community districts
choropleth heatmap of New York City with checkbox to visualize by clicking different columns
choropleth heatmap with checkbox for visualizing different columns

Bar plots of the top 15 districts

Supportive to the maps are the following bar plots of the sorted top 15 results, to get a contrasting view of the data.

bar plots of each column of the top 15 districts
bar plots of each column of the top 15 districts
bar plots of each column of the top 15 districts
bar plots of each column of the top 15 districts
bar plot of top 15 community districts sorted by total score
multiple bar plots which shows all columns to the areas
bar plot of top 15 community districts sorted by total score

7. Discussion of the Result

Our analysis shows that there are several promising districts for the store. Especially South Beach, Upper West Side and Battery Park are high rated. The Distribution of Clothing Stores was the highest in St.George and of Boutiques in Midtown Business District.

As you can see from the map the Cluster 0 (red) is the medium cluster for the requirements of the Customer, it is a pretty big cluster and includes some of the best scored districts. The purple Cluster is mostly located in Manhatten and consist of only 2 high ranked districts. The blue cluster should be ignored.

South Beach located in Staten Island gained the highest score. There is a high frequency of Italian restaurants and the factor that it is a good place to live with a low Crime Rate compensates the medium socio economic data. Choosing this location could mean that the Store will profit from regular customers but there won’t be as many tourists and walk in customers as in Manhatten.

Upper Westside scores with high income, population and parks but there are no hotels directly in the district, which could lead to less touristic customers. On the oter hand the Central Park is close by, which is a touristic hotspot. But probably more touristic than South Beach. On the other hand this is a place where a lot of wealthy people live and the store could benefit from regular customers. It could be a great place to live if the customer prefers to live right in the city Center. The proximity to the Central Park a touristic hot spot could maybe compensate the lack of hotels in the relation of touristic customers.

The Battery Park is a touristic hotspot in New York even though the low population it is in the top 3 districts and got the highest income score. The few People who can afford to live in the top of Manhatten have a high income furthermore there are lots of hotels located in and around the area which guarantees for a great mix of tourists and regular customers.

Tottenville the 4. place is in Staten Island too and has the lowest crime and the lowest unemployment rate of the top 15. The trust score is also one of the highest. Furthermore it has a high overall score and is pretty similar to South Beach.

There is one main decision to make:

Manhatten or Staten Island

8. Conclusion

Purpose of this project was to identify districts which fits best to the diverse requirements of the customer. By evaluating and quantifying his imaginations with the weighted matrix it was possible to identify several districts which combines his requirements for the location of the store and personal living wishes.

For finding the perfect location we now have to go deeper and analyse the top 10 to 15 districts more detailed. We could compare specific neighbourhoods and add more detailed data like tourism frequency to finally find the perfect neighbourhood or even the best street for the store.

--

--