Clustering Population in London to Find a Suitable Location for Ethnic Restaurant
As a digital communication professional, data science is not new to me, but I have never systematically learned it, just picked up the necessary skills on the job. So one of my COVID-19 lockdown commitments was to look into data science and machine learning in a bit more structured way. Google must have guessed my intention as I got a few targeted ads — which led me to IBM’s Data Science Program on Coursera. This post is not a review, it’s about my capstone project.
The idea was to find a suitable location for a traditional ethnic restaurant in London using exploratory data analysis and machine learning. It made sense to look into London’s Chinese population — if you live in the city you must have heard of the Chinese district, possibly the worst place to open a new traditional restaurant, right? But then where else, are there other parts of the city where the concentration of Chinese people* is high and there aren’t many popular Chinese restaurants?
*in the census data people are classified according to their own perceived ethnic group and cultural background
To get things started we need quite a few datasets to work with. Most of the files can be found in my repo except the data files as those are quite big — but check the notebook for the links:
- London’s census data (2011 is the latest) broken down to Middle Layer Super Output Areas (MSOAs) level
- Population weighted centroids for MSOAs
- Land area of MSOAs
- Shapefiles for MSOAs
- Foursquare to get the most popular venue types
Before jumping in…
If you are using any APIs, database connections, or else in your code, it’s a really good practice not to include the credentials in the files that you share, upload to GitHub, or to any other code sharing platform. Personally, I like to use dotenv for Python-related projects — it’s easy to set up: create a .env file in your root folder, store your credentials there, and then add to the beginning of your Jupyter Notebook the following:
%dotenv# get your keyclient_id = %env CLIENT_ID
cliend_secret = %env CLIENT_SECRET
Initial exploratory data analysis
I won’t include here the data wrangling that I did for the combined dataframe of the census, centroids, and land areas datasets — please check the repo if you are interested. I ended up with the following table:
The MSOAs name and total columns are in the table for reference only, we are not going to use those in our analysis. The Chinese population column contains the combined population of all Chinese related ethnic groups, while the latitude and longitude columns are the population-weighted centroids of the MSOAs.
The idea is to find out which neighborhoods have a higher Chinese population, and then explore the most popular restaurant types in those areas. The Office of National Statistics works with different kinds of output areas: Middle Layer Super Output Areas (MSOAs) is a sensible choice as it provides the desired granularity to define our own ‘neighborhoods’. Using the MSOAs, folium’s choropleth map showing the Chinese population will look like the following.
Looks good, doesn’t it? Now we have a better understanding of London’s Chinese population. It seems that there are indeed some preferred neighborhoods, and the most popular part of the city is not even Chinatown but the Isle of Dogs!
For the next step, we could manually start exploring the regions where we see a higher population count but that wouldn't be too objective. Instead, we could cluster the MSOAs based on some attributes and only explore those that are a match.
Our business case is to open a traditional ethnic restaurant so we need to think about our restaurant’s catchment area and the target population that will generate sufficient income for the venue to thrive. As our restaurant’s target population is local Chinese people, I estimated (without knowing anything about the restaurant business) we need around 2000 Chinese residents in a 2km radius, which would translate into 12.57 sq km.
For the sake of simplicity let’s use k-means (which is not a perfect choice as it works with the euclidean distance that doesn’t take into account Earth’s curvature). We are creating a function to iterate over a range of cluster numbers to find the optimum cluster size and number.
and then to plot it…
The maximum number of filtered clusters that we can get is 17, which we can achieve with several pre-determined cluster numbers. After checking a few of them I decided to stick with n_clusters=92.
A quick detour…
In order to visualize the filtered clusters, we have to combine our selected MSOAs based on the cluster labels. Folium cannot do this out of the box, so I turned to another library, GeoPandas. It has a handy function to “dissolve our geometries within a given group together into a single geometric feature”. The trick here is to add back the cluster labels to the original geojson or geopandas file (by matching the MSOA property id) and then dissolve it by the cluster label.
lon_geojson_dissolved = geopandas.GeoDataFrame(lon_geojson_cluster, geometry='geometry').dissolve(by='cluster', as_index=False)
To get the most popular venue types within the clusters (around the centroid), I queried Foursquare. The initial idea was to get the total number of restaurants broken down to venue type, but unfortunately, the API doesn’t allow that. The search endpoint limits the number of returned venues, while the explore endpoint only shows recommended locations. For our current problem, the latter works better although it’s not perfect.
I created a new dataframe to run the function on — which contains the label cluster, the latitude and longitude of the cluster’s centroid, and the total Chinese population. The function returns an array of arrays that can easily be transformed into a crosstab using panda’s similarly named function. Basically, we get a frequency table:
Reviewing the new dataframe, I noticed that there were several types of Chinese restaurants. Ideally, we should have one category thus I queried Foursquare’s categories endpoint for all sub-types — which I used to combine the relevant columns under a single label. Technically now we have all the information to figure out which cluster has the highest Chinese population and the lowest number of popular Chinese restaurants.
…but I also wanted to know what the most popular restaurant types were, and see whether the clusters can be further grouped based on restaurant type popularity. First, I created a rank table, then using the original frequency table I ran another k-means with the following result.
It’s not perfect, but out of four clusters, one nicely groups together our ‘neighborhoods’ where Chinese restaurants are among the most popular venue types, while another contains only those that do not fit into the top ten.
(A quick way to improve the model is to recategorize some of the restaurant types, i.e. just to use the parent category and not the sub-category — what we did for the Chinese restaurants. In order to do that we would need to run a recursive function on Foursquare’s categories endpoint to fully map out the category hierarchy.)
But back to our clusters… I chose cluster no 1, where Chinese restaurants don’t make it to the top ten venue types. Throwing our data on a map gives us the following.
The red circles (i.e. the radius) mark the number of Chinese restaurants in the regions. We are looking for darker regions with smaller circles, just the one that is on the western end of the choropleth map. Looking up the centroid we get:
Earl’s Court — our winner!
I hope you enjoyed this exercise and gave you some inspiration on how to use population data, choropleth maps, and clustering algorithms. Please leave a comment if you have any observations or suggestions to improve the code.
Naturally, and I hate to write this down, but the outcome of this exercise is not a recommendation by any means to open a traditional Chinese restaurant around the identified location. Do your own research!
.. but if you do open and becomes successful, please let me know.