Identifying Commercial Centers Using Machine Learning
A quick walk-through from identifying the commercial center using ML to deploying it as a Streamlit app.
A Commercial Centre contains a high concentration of business, civic and cultural activities, also known as downtown. It is essential to get to know the commercial city centers if you want to start any business as it also helps identify customer needs and develop your business. To identify the commercial center of any city, clustering of Points of Interest(POI) of the city data with the correct amenities of interest is needed. A Point of Interest is generally any place that a person finds useful usually indicated by latitude and longitude along with some attributes say the name of the area, and a category it belongs to. In this article, with POI data of a city, we will identify the Commercial Centres using Machine Learning.
Machine Learning(ML) also deals with the clustering of data points to find insights. Unsupervised Machine Learning algorithms are commonly used for such geospatial analysis to identify commercial centers. Scikit-learn, a python library for ML contains clustering algorithms for such an unsupervised learning problem.
We’ll use the python libraries Overpy-to query data from OSM, Folium to plot the map and clusters, Scikit-learn for implementing ML algorithms, and a few other basic libraries like NumPy, and Pandas for our project.
The Geographic Information System(GIS) provides the spatial data of any city, some of the popular GIS providers include Open Street Map(OSM), Natural Earth Data, Open Topography, etc.
The spatial data of a city can be queried from OpenStreetMap(OSM), using the python package Overpy.
Overpy, a Python wrapper is used to access the Overpass API of OpenStreetMap(OSM) to fetch the POI of the city. The Overpy returns a list of nodes along with node_id, lat, long, and other details of the POI, along with these it also returns the JSON tags of the particular node.
Python provides various packages for spatial data visualization, one such is folium.Folium is a leaflet.js python wrapper that helps in plotting interactive geospatial maps. It also provides various base maps and allows one to draw polygons over the map easily, which will help us to show the clusters of commercial centers of any city on a map. In this article, we will build a simple web application using Streamlit-an open-source python app framework, where a city from the user is taken, whose data is fetched from Open Street Map (OSM), after pre-processing the data, outliers are removed and clusters are plotted on the map. Along with identifying the commercial center, it also forms clusters of the top 5 amenities in the city.
For removing outliers we use the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. And for the clustering part, we tried various algorithms on the resultant dataframe say KMeans, KMeans++, K Medoids, OPTICS, and DBSCAN, the best one is taken for clustering the coordinates.
You can find the source code of this project here: GitHub
Here is the step-by-step outline of the project:
- Fetch City details from OSM
- Remove outliers using DBSCAN
- Cluster using KMeans++
- Plot cluster in Folium Map
- Group the amenities and cluster the top 5 amenities
The project comprises 5 different modules:
app.py- comprises of Streamlit UI and function calls to cluster_model.py to identify the commercial centers.
cluster_model.py- has the functions to get the city details, remove the outliers, form clusters and plot in map. It has contains functions to group the various amenities, clusters and plot in the map.
config.py- contains the configurations for few variables
convex_hull.py- uses Jarvis’s algorithm for creating convex hull and defines a function,
apply_convex_hull()
that returns the coordinates of the convex hull polygon.map_legend.py- it adds ‘legend’ to the folium map.
1. Fetch City details from OSM
To begin, let’s install overpy
and streamlit
pip install streamlit
pip install overpy
In app.py, lets import streamlit
and create a simple UI to get the city name from the user.
app.py
Now, in cluster_model.py, import overpy
and define fetch_city_data()
that takes city_name
as a parameter and uses an Overpass API query to fetch the city details. The query returns the city details in JSON and contains unnecessary nodes that don't contribute to the city's commercial centers. So, it is necessary to remove blank nodes and convert the city
details to a dataframe(for easier access). For this, let's define another function say, df_preprocess()
that takes the results of the API query as input.
cluster_model.py
In df_preprocess()
with res
as parameter converts the JSON to DataFrame and lets subset only necessary columns say- ‘node_id’, ‘lat’, ‘lon’, ‘name’, ‘amenity’. Also, remove the unnecessary amenities that don't contribute to commercial centers and return the resultant data frame to fetch_city_data()
.
cluster_model.py
Again in app.py, let’s make the data frame visible to the user.
app.py
On fetching data from OpenStreetMap using Overpass query, sometimes the data for the city maynot be found so its better to call the
fetch_city_data()
undertry
block.
On the above code snippet, if the city
name is provided by the user, fetch_city_data()
gets called from cluster_model.py
and finally returns a dataframe that gets displayed using st.dataframe
.
2. Remove outliers using DBSCAN
The resultant dataframe df
from the previous section will be taken and subsetting only the ‘lat’, and ‘lon’ fields we will apply DBSCAN to form clusters and we will get all the coordinate points (lat,lon) of POI in the clusters, thus removing the outliers and also we’ll get the number of clusters to be formed in later applying KMeans++ for an efficient clustering.
Why DBSCAN for removal of outliers?
Density-Based Spatial Clustering of Applications with Noise aka DBSCAN is an density-based unsupervised Machine Learning Clustering Algorithm that is robust to outliers. DBSCAN uses distance and a minimum number of points per cluster to classify a point as an outlier. In DBSCAN, it creates epsilon radius circle around data points and classifies as Core point, Border point and Noise point.
Core Point-If the data point contains at least ‘minPoints’ number of points in its epsilon then the data point is said as Core point.
Border Point- If the data point contains less than the ‘minPoints’ number of points in its epsilon then the data point is said as Border Point.
Noise point-If there are no other data points around any data point within epsilon radius, then the dat point is treated as Noise.
DBSCAN takes two important parameter: epsilon and minPoints.
Epsilon is defined as the radius of the circle to be created around the data point to check the density.
minPoints is defined as the minimum number of data points required inside the epsilon of that data point to be classified as a Core point.
Let’s define our outlier_dbscan()
that take a dataframe as a parameter say, data
and subset the ‘lat’ and ‘lon’ fields, convert them to NumPy store them in variable coords
.
Next, we compute DBSCAN. The epsilon parameter is the max distance (0.5 km in this example) that points can be from each other to be considered a cluster. The min_samples parameter is the minimum cluster size (everything else gets classified as noise). We’ll set min_samples to 10 so that every data point gets assigned to either a cluster or forms its cluster of 10, less than 10 are considered noise. We use the haversine metric and ball tree algorithm to calculate great circle distances between points. Haversine calculates the distance between two points on Earth using their latitude and longitude whereas the Ball Tree Algorithm, a metric tree algorithm is used for the spatial division of data points and allocates them into cluster groups.
Notice that our epsilon and coordinates get converted to radians because sci-kit-learn’s haversine metric needs radian units:
cluster_model.py
In the above code snippet,outlier_dbscan()
returns a list with a dataframe containing only coordinates in the clusters, subsetted from the dataframe ‘x’ i.e x[s.values]
and the number of clusters formed by DBSCAN, num_clusters
.
If you notice the code, we imported something called config
and used it to pass the values for epsilon and min_samples.
For modularity, let’s place all our configurations in a separate file config.py.
config.py
3. Cluster Using KMeans++
Now, let’s cluster the coordinates of the resultant dataframe from the previous section using KMeans++, a centroid initialization technique for KMeans. This is done because the clusters of DBSCAN can be of any shape so we cannot have a polygon to be plotted on the map, whereas KMeans++ generates clusters of shape convex that can be plotted as a polygon in the folium map, as it distinctly shows the commercial centers of the city. KMeans++ generally gets implemented with the sci-kit-learn KMeans algorithm, with ‘k-means++’ in its initialization parameter.
What is KMeans?
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. This Unsupervised algorithm, partitions data points into ‘k’ clusters, by selecting random ‘k’ centroids and assign the data points to the closest cluster centroid. It reassign the centroids, such a way that points within the cluster has minimum distance from the centroid. Here the number of clusters to be formed are explicity defined. Has two initilization schema by scikit-learn: “random” and “k-means++”, where the k-means++ generally shows better result than the random.
Why KMeans++?
The main drawback of K-Means algorithm is that it is dependent on the initialization of the centroids. For example, if a centroid is introduced to be a “far away” point, it may very well wind up without any data point related with it and simultaneously more than one cluster may wind up connected with a solo centroid. Likewise, more than one centroids may be introduced into a similar group bringing about poor clustering. To overcome this centroid initialisation we use KMeans++.
K-Means++ , a centroid initialisation technique for KMeans Clustering,generally initializes the centroids distant from each other and shows better results than random initialization. Here, the centroid are initialized before applying KMeans algorithm to the data.
As well as, on testing 10 random cities taken with the dataframe resultant after DBSCAN for outlier removal and applying KMeans with random initialisation, KMeans++ , K- Melodiod, OPTICS and DBSCAN algorithm. And for determining the goodness of a clustering technique, we use Silhouette coefficient, a metric whose value ranges between -1 to 1. The following are the results of the Silhouette coefficient for the algorithms mentioned above:
So from the above table we could see that than KMeans++ shows a better clustering results than KMeans, OPTICS, DBSCAN, and K-Melodid for this dataset of each city.(Python Notebook).
Let’s define cluster_Kmeans()
, which takes dataframe after the removal of outliers using DBSCAN, data
and the number of clusters to be formed by KMeans, num_clusters
.
The columns ‘lat’, ‘lon’ is subsetted from the dataframe and converted to NumPy, and stored in coords
.
Now, we compute KMeans for coords
. We use the k-means++ initialization scheme, a centroid initialization technique. Also, here, the number of clusters to be formed is taken as the number of clusters formed using DBSCAN and the random_state
is 42 (set by the config.py
).
cluster_model.py
The cluster_Kmeans()
returns a list km
that contains the num_clusters
, coords
, y_kmeans
that contains KMeans clustering results and the data
,i.e the dataframe.
In config.py
:
config.py
The ultimate aim of using KMeans Clustering here is to plot the cluster as a polygon since KMeans clusters are of convex shape, let’s define clusters_convex()
, which categorizes the clusters into most_significant clusters and least_significant clusters, and also returns the convex polygon coordinates of the same.
In clusters_convex()
, which takes a list km_return
as parameter containing num_clusters
, coords
, y_kmeans
and the data
(as returned from cluster_Kmeans()
. Here, if the length of coords
in the cluster is >45 then the cluster is appended after applying apply_convex_hull()
to most_significant
list else to least_significant
list.
Now, the most_significant
list contains the convex hull points of the most significant clusters and the least_significant
list contains the convex hull points of the least significant clusters.
cluster_model.py
The clusters_convex()
finally, returns the convex hull points of the most significant clusters and the least significant clusters as we described earlier. As we look into the code, we imported functions from convex_hull.py
and also new function apply_convex_hull()
is used.
In convex_hull.py
, the apply_convex_hull()
takes the cluster coordinates and returns the convex hull points of the cluster. Here, to compute the convex hull Jarvis’s Algorithm is used.
convex_hull.py
If you look outlier_dbscan()
, cluster_Kmeans()
, clusters_convex()
are all used to convert the dataframe into polygons of most_significant and least_significant clusters and take input from the output of another function, so let’s make their function calls combine under a single function, say cluster_models()
.
cluster_model.py
The cluster_models()
takes the city dataframe after df_preprocess()
and calls outlier_dbscan()
whose outputs are passed to cluster_Kmeans()
followed by cluster_convex()
that finally returns the convex hull points of the most significant clusters and the least significant clusters along with the coords
.
4. Plot cluster in Folium Map
Now with coordinates of the convex hull for the most significant and least significant clusters of the city, let’s plot the same using folium on a map. Let’s define a function mapplot()
that does the plotting of polygons on the map.
Install the packages
!pip install folium
The mapplot()
takes the most_significant
, least_significant
and coords
resulted from cluster_models()
. A folium map, map_osm
is created for a the location
given by coords
. The coordinates of the city, coords
are plotted on map_osm
using CircleMarker()
, a folium function that plots the coordinates (say latitude and longitude pairs) in the map_osm
.
Polygons have been plotted the map_osm
using Polygon()
, folium function takes the polygon coordinates and plots the same in the map. Here, for the most significant clusters, the most_significant
list is traversed, and the polygon with the border is black and filled by the color red is plotted. Similarly, for the least significant clusters, the least_significant
list is traversed, and the polygon with the border is blue and filled by the color yellow is plotted.
cluster_model.py
With the help of folium plugins and raster_layers, we added different types of map layouts to map_osm
, to view them added LayerControl()
and also added mini-map and fullscreen features to the map. To add a legend to the map_osm
map, use a macro that calls add_map_legend()
from maplegend.py
(Check: To add a legend to a folium map). The mapplot()
returns the final map.
maplegend.py
Now let’s expand our app.py
that displays call the cluster_model()
that return the most_signiicant
, least_Significant
and coords
and plot them using mapplot()
. The map generated mapplot()
gets displayed in Streamlit using folium_static. To render folium-map Streamlit has a special component streamlit_folium
that has a function folium_static()
that displays the folium map in our Streamlit app.
app.py
Now, the Streamlit web app results as shown below:
5. Group and cluster the top 5 amenities
After identifying the commercial centers, let’s try to understand the city amenities by grouping them into some common categories and clustering the top 5 amenities in the city.
Let’s define a function amenity_df()
, which takes the city_data
dataframe as the parameter, and defines a few general amenity categories like food_list,bank_list,education_list,hospital_list
, etc., so that the amenities can be grouped under them.
Traverse through the dataframe, city_data
and append the latitudes and longitudes of the city_data
to the respective category based on the amenity
column in the city_data
dataframe. Count the amenity present under each of the categories. Thus, returns a new datagram df
containing Amenity
-the amenity group name, lat_lon
-its latitude and longitude of all amenities of that category, Count
-length of the amenity group.
cluster_model.py
In amenity_df
, we defined the list of amenities that can be grouped under a single category and then appended the lat
and lon
values to a new category list if it falls under the amenity category. For example, food_list
groups certain amenities under it, and we append the lat
and lon
from the city_data
dataframe to the food
list, if the amenity
category of the city_data
comes under the food_list
.
Then for the Count
column, we enumerate to the amenities_list
that contains all the categories lat
and lon
of the amenities group and find its length, say count_amenity
contains the count of the amenities group.
Parse the amenities_str,amenities_list,count_amenity
to a dataframe and arrange it in ascending order of Count
column.
To view the amenity distribution of the city, let’s plot a simple bar chart, that takes the resultant dataframe from amenity_df()
and plot a bar chart using plotly.express
, high-level interface to Plotly-python library that creates amazing visualization charts.
The bar plot is plotted against Amenity
and Count
column of the df
and the resultant graph figure, fig
is returned.
cluster_model.py
To cluster the top 5 amenities, let’s define a function top5()
, which takes the resultant dataframe from the amenity_df()
cluster the amenities based on lat_lon
stored as amenity_array
for the amenity_name.
Explicitly define the number of clusters for KMeans and plot the convex hull polygon of the same in the map.
Here, amenity_array
contains the latitude and longitude of all the amenities of the particular category. Now, defining the number of clusters for KMeans if the length of amenity_array
<60, then the number of clusters, n_clusters
is set to 5 else n_clusters
is set to 20.
After computing the KMeans, we declare polygon
-the list containing the convex hull polygon coordinates of the cluster after applying the apply_convex_hull()
, as described earlier in the blog.
Then define amenity_map_osm
, the map that contains the polygon of the amenity and returns the same.
cluster_model.py
Now let’s put them all together in our Streamlit app, app.py
:
app.py
Here, the city_data
is passed to amenity_df()
returning a dataframe, city_amenity
containing 'Amenity','lat_lon','Count'
columns where Amenity
is the name of the amenity group and lat_lon
contains the list of latitude and longitude coordinates of the amenities in that group, Count
contains the number of amenities in that group.
The top5name
contains the list of the top 5 amenity group names, (used for naming the tabs).
The barplot()
returns the bar chart on Amenity-Count
of the city_data
, returns a figure stored in barplt
.
We use Streamlit tabs, to view the bar plot and the top 5 amenities cluster respectively.
In Streamlit, the plotly chart gets displayed in tab1
using st.plotly_chart()
taking the figure barplt
as a parameter.
For the other 5 tabs, we use folium_static()
to display the folium map that calls the top5()
to generate a polygon cluster map of the amenity usingcity_amenity
dataframe, and order of the amenity in the top 5- ilocation
. Say example, for the top 1 amenity, we pass top5(city_amenity,0)
as the index starts with 0 in Python.
The final app.py:
app.py
Conclusion
To summarize, in this blog we built a Streamlit app that identifies the commercial center of any city by plotting the clusters in the map with the help of folium and also plotted the top 5 amenities of the city with the help of DBSCAN to remove the outliers and KMeans++ to plot the cluster as a convex shaped polygon in the map. The web app is deployed in Streamlit share, do check it out: https://identifying-commercial-centres-using-ml.streamlit.app/
If you’ve enjoyed this article or have any questions feel free to connect with me on LinkedIn.
You can find the source code of this project here: GitHub
Reference
[1] Commercial Center Using POI By Aakashjhawar
[4] Jarvis’s Algorithm for Convex Hull:
[5] Convert to radians for spatial data
Thanks to Sean Benhur for his helpful comments on this project and for reviewing the article.