Identifying Commercial Centers Using Machine Learning

A quick walk-through from identifying the commercial center using ML to deploying it as a Streamlit app.

Sowmya D
Geek Culture
14 min readDec 2, 2022

--

Point of Interest image by Alexsl [Source:istock]

A Commercial Centre contains a high concentration of business, civic and cultural activities, also known as downtown. It is essential to get to know the commercial city centers if you want to start any business as it also helps identify customer needs and develop your business. To identify the commercial center of any city, clustering of Points of Interest(POI) of the city data with the correct amenities of interest is needed. A Point of Interest is generally any place that a person finds useful usually indicated by latitude and longitude along with some attributes say the name of the area, and a category it belongs to. In this article, with POI data of a city, we will identify the Commercial Centres using Machine Learning.

Machine Learning(ML) also deals with the clustering of data points to find insights. Unsupervised Machine Learning algorithms are commonly used for such geospatial analysis to identify commercial centers. Scikit-learn, a python library for ML contains clustering algorithms for such an unsupervised learning problem.

We’ll use the python libraries Overpy-to query data from OSM, Folium to plot the map and clusters, Scikit-learn for implementing ML algorithms, and a few other basic libraries like NumPy, and Pandas for our project.

The Geographic Information System(GIS) provides the spatial data of any city, some of the popular GIS providers include Open Street Map(OSM), Natural Earth Data, Open Topography, etc.

The spatial data of a city can be queried from OpenStreetMap(OSM), using the python package Overpy.

Overpy, a Python wrapper is used to access the Overpass API of OpenStreetMap(OSM) to fetch the POI of the city. The Overpy returns a list of nodes along with node_id, lat, long, and other details of the POI, along with these it also returns the JSON tags of the particular node.

Python provides various packages for spatial data visualization, one such is folium.Folium is a leaflet.js python wrapper that helps in plotting interactive geospatial maps. It also provides various base maps and allows one to draw polygons over the map easily, which will help us to show the clusters of commercial centers of any city on a map. In this article, we will build a simple web application using Streamlit-an open-source python app framework, where a city from the user is taken, whose data is fetched from Open Street Map (OSM), after pre-processing the data, outliers are removed and clusters are plotted on the map. Along with identifying the commercial center, it also forms clusters of the top 5 amenities in the city.

For removing outliers we use the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. And for the clustering part, we tried various algorithms on the resultant dataframe say KMeans, KMeans++, K Medoids, OPTICS, and DBSCAN, the best one is taken for clustering the coordinates.

GIF Source: Author

You can find the source code of this project here: GitHub

Here is the step-by-step outline of the project:

  1. Fetch City details from OSM
  2. Remove outliers using DBSCAN
  3. Cluster using KMeans++
  4. Plot cluster in Folium Map
  5. Group the amenities and cluster the top 5 amenities

The project comprises 5 different modules:

app.py- comprises of Streamlit UI and function calls to cluster_model.py to identify the commercial centers.

cluster_model.py- has the functions to get the city details, remove the outliers, form clusters and plot in map. It has contains functions to group the various amenities, clusters and plot in the map.

config.py- contains the configurations for few variables

convex_hull.py- uses Jarvis’s algorithm for creating convex hull and defines a function, apply_convex_hull() that returns the coordinates of the convex hull polygon.

map_legend.py- it adds ‘legend’ to the folium map.

1. Fetch City details from OSM

To begin, let’s install overpy and streamlit

In app.py, lets import streamlitand create a simple UI to get the city name from the user.

app.py

Image Source: Author

Now, in cluster_model.py, import overpy and define fetch_city_data() that takes city_nameas a parameter and uses an Overpass API query to fetch the city details. The query returns the city details in JSON and contains unnecessary nodes that don't contribute to the city's commercial centers. So, it is necessary to remove blank nodes and convert the city details to a dataframe(for easier access). For this, let's define another function say, df_preprocess() that takes the results of the API query as input.

cluster_model.py

In df_preprocess() with res as parameter converts the JSON to DataFrame and lets subset only necessary columns say- ‘node_id’, ‘lat’, ‘lon’, ‘name’, ‘amenity’. Also, remove the unnecessary amenities that don't contribute to commercial centers and return the resultant data frame to fetch_city_data() .

cluster_model.py

Again in app.py, let’s make the data frame visible to the user.

app.py

On fetching data from OpenStreetMap using Overpass query, sometimes the data for the city maynot be found so its better to call the fetch_city_data() under tryblock.

On the above code snippet, if the city name is provided by the user, fetch_city_data() gets called from cluster_model.py and finally returns a dataframe that gets displayed using st.dataframe .

Image Source: Author

2. Remove outliers using DBSCAN

The resultant dataframe dffrom the previous section will be taken and subsetting only the ‘lat’, and ‘lon’ fields we will apply DBSCAN to form clusters and we will get all the coordinate points (lat,lon) of POI in the clusters, thus removing the outliers and also we’ll get the number of clusters to be formed in later applying KMeans++ for an efficient clustering.

Why DBSCAN for removal of outliers?

Density-Based Spatial Clustering of Applications with Noise aka DBSCAN is an density-based unsupervised Machine Learning Clustering Algorithm that is robust to outliers. DBSCAN uses distance and a minimum number of points per cluster to classify a point as an outlier. In DBSCAN, it creates epsilon radius circle around data points and classifies as Core point, Border point and Noise point.

Core Point-If the data point contains at least ‘minPoints’ number of points in its epsilon then the data point is said as Core point.

Border Point- If the data point contains less than the ‘minPoints’ number of points in its epsilon then the data point is said as Border Point.

Noise point-If there are no other data points around any data point within epsilon radius, then the dat point is treated as Noise.

DBSCAN takes two important parameter: epsilon and minPoints.

Epsilon is defined as the radius of the circle to be created around the data point to check the density.

minPoints is defined as the minimum number of data points required inside the epsilon of that data point to be classified as a Core point.

Let’s define our outlier_dbscan()that take a dataframe as a parameter say, data and subset the ‘lat’ and ‘lon’ fields, convert them to NumPy store them in variable coords.

Next, we compute DBSCAN. The epsilon parameter is the max distance (0.5 km in this example) that points can be from each other to be considered a cluster. The min_samples parameter is the minimum cluster size (everything else gets classified as noise). We’ll set min_samples to 10 so that every data point gets assigned to either a cluster or forms its cluster of 10, less than 10 are considered noise. We use the haversine metric and ball tree algorithm to calculate great circle distances between points. Haversine calculates the distance between two points on Earth using their latitude and longitude whereas the Ball Tree Algorithm, a metric tree algorithm is used for the spatial division of data points and allocates them into cluster groups.

Notice that our epsilon and coordinates get converted to radians because sci-kit-learn’s haversine metric needs radian units:

cluster_model.py

In the above code snippet,outlier_dbscan() returns a list with a dataframe containing only coordinates in the clusters, subsetted from the dataframe ‘x’ i.e x[s.values]and the number of clusters formed by DBSCAN, num_clusters .

If you notice the code, we imported something called configand used it to pass the values for epsilon and min_samples.

For modularity, let’s place all our configurations in a separate file config.py.

config.py

3. Cluster Using KMeans++

Now, let’s cluster the coordinates of the resultant dataframe from the previous section using KMeans++, a centroid initialization technique for KMeans. This is done because the clusters of DBSCAN can be of any shape so we cannot have a polygon to be plotted on the map, whereas KMeans++ generates clusters of shape convex that can be plotted as a polygon in the folium map, as it distinctly shows the commercial centers of the city. KMeans++ generally gets implemented with the sci-kit-learn KMeans algorithm, with ‘k-means++’ in its initialization parameter.

What is KMeans?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. This Unsupervised algorithm, partitions data points into ‘k’ clusters, by selecting random ‘k’ centroids and assign the data points to the closest cluster centroid. It reassign the centroids, such a way that points within the cluster has minimum distance from the centroid. Here the number of clusters to be formed are explicity defined. Has two initilization schema by scikit-learn: “random” and “k-means++”, where the k-means++ generally shows better result than the random.

Why KMeans++?

The main drawback of K-Means algorithm is that it is dependent on the initialization of the centroids. For example, if a centroid is introduced to be a “far away” point, it may very well wind up without any data point related with it and simultaneously more than one cluster may wind up connected with a solo centroid. Likewise, more than one centroids may be introduced into a similar group bringing about poor clustering. To overcome this centroid initialisation we use KMeans++.

K-Means++ , a centroid initialisation technique for KMeans Clustering,generally initializes the centroids distant from each other and shows better results than random initialization. Here, the centroid are initialized before applying KMeans algorithm to the data.

As well as, on testing 10 random cities taken with the dataframe resultant after DBSCAN for outlier removal and applying KMeans with random initialisation, KMeans++ , K- Melodiod, OPTICS and DBSCAN algorithm. And for determining the goodness of a clustering technique, we use Silhouette coefficient, a metric whose value ranges between -1 to 1. The following are the results of the Silhouette coefficient for the algorithms mentioned above:

Silhouette Coefficient

So from the above table we could see that than KMeans++ shows a better clustering results than KMeans, OPTICS, DBSCAN, and K-Melodid for this dataset of each city.(Python Notebook).

Let’s define cluster_Kmeans(), which takes dataframe after the removal of outliers using DBSCAN, dataand the number of clusters to be formed by KMeans, num_clusters .

The columns ‘lat’, ‘lon’ is subsetted from the dataframe and converted to NumPy, and stored in coords .

Now, we compute KMeans for coords. We use the k-means++ initialization scheme, a centroid initialization technique. Also, here, the number of clusters to be formed is taken as the number of clusters formed using DBSCAN and the random_state is 42 (set by the config.py ).

cluster_model.py

The cluster_Kmeans() returns a list km that contains the num_clusters , coords , y_kmeans that contains KMeans clustering results and the data ,i.e the dataframe.

In config.py :

config.py

The ultimate aim of using KMeans Clustering here is to plot the cluster as a polygon since KMeans clusters are of convex shape, let’s define clusters_convex() , which categorizes the clusters into most_significant clusters and least_significant clusters, and also returns the convex polygon coordinates of the same.

In clusters_convex() , which takes a list km_return as parameter containing num_clusters , coords , y_kmeans and the data (as returned from cluster_Kmeans() . Here, if the length of coords in the cluster is >45 then the cluster is appended after applying apply_convex_hull()to most_significant list else to least_significant list.

Now, the most_significant list contains the convex hull points of the most significant clusters and the least_significantlist contains the convex hull points of the least significant clusters.

cluster_model.py

The clusters_convex() finally, returns the convex hull points of the most significant clusters and the least significant clusters as we described earlier. As we look into the code, we imported functions from convex_hull.py and also new function apply_convex_hull() is used.

In convex_hull.py , the apply_convex_hull() takes the cluster coordinates and returns the convex hull points of the cluster. Here, to compute the convex hull Jarvis’s Algorithm is used.

convex_hull.py

If you look outlier_dbscan() , cluster_Kmeans() , clusters_convex() are all used to convert the dataframe into polygons of most_significant and least_significant clusters and take input from the output of another function, so let’s make their function calls combine under a single function, say cluster_models() .

cluster_model.py

The cluster_models() takes the city dataframe after df_preprocess() and calls outlier_dbscan() whose outputs are passed to cluster_Kmeans() followed by cluster_convex() that finally returns the convex hull points of the most significant clusters and the least significant clusters along with the coords.

4. Plot cluster in Folium Map

Now with coordinates of the convex hull for the most significant and least significant clusters of the city, let’s plot the same using folium on a map. Let’s define a function mapplot() that does the plotting of polygons on the map.

Install the packages

The mapplot() takes the most_significant , least_significant and coords resulted from cluster_models() . A folium map, map_osm is created for a the location given by coords . The coordinates of the city, coords are plotted on map_osm using CircleMarker() , a folium function that plots the coordinates (say latitude and longitude pairs) in the map_osm.

Polygons have been plotted the map_osm using Polygon() , folium function takes the polygon coordinates and plots the same in the map. Here, for the most significant clusters, the most_significant list is traversed, and the polygon with the border is black and filled by the color red is plotted. Similarly, for the least significant clusters, the least_significant list is traversed, and the polygon with the border is blue and filled by the color yellow is plotted.

cluster_model.py

With the help of folium plugins and raster_layers, we added different types of map layouts to map_osm , to view them added LayerControl() and also added mini-map and fullscreen features to the map. To add a legend to the map_osm map, use a macro that calls add_map_legend() from maplegend.py (Check: To add a legend to a folium map). The mapplot() returns the final map.

maplegend.py

Now let’s expand our app.py that displays call the cluster_model() that return the most_signiicant , least_Significant and coords and plot them using mapplot() . The map generated mapplot() gets displayed in Streamlit using folium_static. To render folium-map Streamlit has a special component streamlit_folium that has a function folium_static() that displays the folium map in our Streamlit app.

app.py

Now, the Streamlit web app results as shown below:

GIF Source: Author

5. Group and cluster the top 5 amenities

After identifying the commercial centers, let’s try to understand the city amenities by grouping them into some common categories and clustering the top 5 amenities in the city.

Let’s define a function amenity_df() , which takes the city_data dataframe as the parameter, and defines a few general amenity categories like food_list,bank_list,education_list,hospital_list , etc., so that the amenities can be grouped under them.

Traverse through the dataframe, city_dataand append the latitudes and longitudes of the city_data to the respective category based on the amenity column in the city_data dataframe. Count the amenity present under each of the categories. Thus, returns a new datagram df containing Amenity-the amenity group name, lat_lon-its latitude and longitude of all amenities of that category, Count -length of the amenity group.

cluster_model.py

In amenity_df , we defined the list of amenities that can be grouped under a single category and then appended the lat and lon values to a new category list if it falls under the amenity category. For example, food_list groups certain amenities under it, and we append the lat and lon from the city_data dataframe to the food list, if the amenity category of the city_data comes under the food_list .

Then for the Count column, we enumerate to the amenities_list that contains all the categories lat and lon of the amenities group and find its length, say count_amenity contains the count of the amenities group.

Parse the amenities_str,amenities_list,count_amenity to a dataframe and arrange it in ascending order of Count column.

To view the amenity distribution of the city, let’s plot a simple bar chart, that takes the resultant dataframe from amenity_df() and plot a bar chart using plotly.express , high-level interface to Plotly-python library that creates amazing visualization charts.

The bar plot is plotted against Amenity and Count column of the df and the resultant graph figure, fig is returned.

cluster_model.py

To cluster the top 5 amenities, let’s define a function top5() , which takes the resultant dataframe from the amenity_df() cluster the amenities based on lat_lon stored as amenity_array for the amenity_name. Explicitly define the number of clusters for KMeans and plot the convex hull polygon of the same in the map.

Here, amenity_array contains the latitude and longitude of all the amenities of the particular category. Now, defining the number of clusters for KMeans if the length of amenity_array <60, then the number of clusters, n_clusters is set to 5 else n_clusters is set to 20.

After computing the KMeans, we declare polygon -the list containing the convex hull polygon coordinates of the cluster after applying the apply_convex_hull() , as described earlier in the blog.

Then define amenity_map_osm , the map that contains the polygon of the amenity and returns the same.

cluster_model.py

Now let’s put them all together in our Streamlit app, app.py :

app.py

Here, the city_data is passed to amenity_df() returning a dataframe, city_amenity containing 'Amenity','lat_lon','Count' columns where Amenity is the name of the amenity group and lat_lon contains the list of latitude and longitude coordinates of the amenities in that group, Count contains the number of amenities in that group.

The top5name contains the list of the top 5 amenity group names, (used for naming the tabs).

The barplot() returns the bar chart on Amenity-Count of the city_data , returns a figure stored in barplt .

We use Streamlit tabs, to view the bar plot and the top 5 amenities cluster respectively.

In Streamlit, the plotly chart gets displayed in tab1using st.plotly_chart() taking the figure barplt as a parameter.

For the other 5 tabs, we use folium_static() to display the folium map that calls the top5() to generate a polygon cluster map of the amenity usingcity_amenity dataframe, and order of the amenity in the top 5- ilocation . Say example, for the top 1 amenity, we pass top5(city_amenity,0) as the index starts with 0 in Python.

The final app.py:

app.py

Conclusion

To summarize, in this blog we built a Streamlit app that identifies the commercial center of any city by plotting the clusters in the map with the help of folium and also plotted the top 5 amenities of the city with the help of DBSCAN to remove the outliers and KMeans++ to plot the cluster as a convex shaped polygon in the map. The web app is deployed in Streamlit share, do check it out: https://identifying-commercial-centres-using-ml.streamlit.app/

If you’ve enjoyed this article or have any questions feel free to connect with me on LinkedIn.

You can find the source code of this project here: GitHub

Reference

[1] Commercial Center Using POI By Aakashjhawar

[2] DBSCAN for outliers

[3] Add draggable legend

[4] Jarvis’s Algorithm for Convex Hull:

[5] Convert to radians for spatial data

Thanks to Sean Benhur for his helpful comments on this project and for reviewing the article.

--

--