Clustering geographic data on an interactive map in python

Tanakan Chua
CodeX
Published in
7 min readMar 31, 2022

--

Covid-19 has stayed with us for about 3 years, people have to change their behavior and there are many opportunities for new jobs. However, data science is still one of the most popular jobs. Certainly, Data visualization and Machine learning are crucial skills for this role. Imagine two data scientists have to bring their machine learning output to present in the business term. The first one just presents output in a spreadsheet and the other one presents output in an interactive graphical view. The latter will have a chance to draw more attention from listeners.

Example output from the first data scientist
Example output from the latter data scientist

How cool the latter data scientist!

Therefore, this story will give an example that integrates clustering geographic data (latitude and longitude) by using the K-mean method and draws the result on an interactive map in python. (Alert, this article will not explain each theory we will use but focusing integrating them to get results for a business case). OK, let’s code.

Business case: A company has many stores to serve our customers. However, this company wants to cluster them into some groups to reinforce the stores by sharing resources together in each group because some stores sometimes have some problems and cannot serve customers. Your role as a data scientist has to suggest to this company how many clusters should we have and how the boundary of each group is

The python packages we will use

  1. pandas [For data manipulation]
  2. sklearn [For K-mean clustering]
  3. matplotlib [For K-mean clustering]
  4. scipy [For clustering boundary creation]
  5. folium [For map visualiation] if you want some more background of folium ,you can read my previous article for more details

first, I created random data for this business case shown as follows.

Example input data
  1. name= store name,
  2. lat= Latitude,
  3. lon=longtiude,
  4. business_condition=indicator that causes store cannot operate which I assumed more than 500

We have to import this data from pandas and find out how many clusters should we use by using the elbow method to find the optimal number of clusters. (However, in real life, we may have business constraints that already specify how many groups we will cluster, so in that case, we do not need to use the elbow method.)

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from scipy.spatial import ConvexHull
import folium
df_map = pd.read_excel('input_clustering.xlsx')
# 1. Clustering your data into KMeans clustering one of the unsupervise clsutering method
#1.1 data preparation
X = df_map .iloc[:, 1:3].values
# Using the elbow method to find the optimal number of clusters
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

This time we will use 4 groups which is the elbow point of the graph. We cluster our data again by K-mean at 4 clusters.

# 1.2 Training the K-Means model regarding to your elbow method or business logic groups
kmeans = KMeans(n_clusters = 4, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X)
# 1.3 map data back to df
df_map['cluster'] = y_kmeans +1 # to step up to group 1 to 4

Up to now, we have the output like the first picture above which is the example of the first data scientist. Next, we draw this result in an interactive map by using folium to become the later data scientist.

#2.plot data to map
# Create the map object called m which is the base layer of the map
m = folium.Map(location=[df_map['lat'].mean(), df_map['lon'].mean()],
tiles='CartoDB positron',
zoom_start=7)
# create layers based on your clustering groups
layer1 = folium.FeatureGroup(name= '<u><b>group1</b></u>',show= True)
m.add_child(layer1)
layer2 = folium.FeatureGroup(name= '<u><b>group2</b></u>',show= True)
m.add_child(layer2)
layer3 = folium.FeatureGroup(name= '<u><b>group3</b></u>',show= True)
m.add_child(layer3)
layer4 = folium.FeatureGroup(name= '<u><b>group4</b></u>',show= True)
m.add_child(layer4)
#draw marker class for each group by adding CSS class
my_symbol_css_class= """ <style>
.fa-g1:before {
font-family: Arial;
font-weight: bold;
font-size: 12px;
color: black;
background-color:white;
border-radius: 10px;
white-space: pre;
content: ' g1 ';
}
.fa-g2:before {
font-family: Arial;
font-weight: bold;
font-size: 12px;
color: black;
background-color:white;
border-radius: 10px;
white-space: pre;
content: ' g2 ';
}
.fa-g3:before {
font-family: Arial;
font-weight: bold;
font-size: 12px;
color: black;
background-color:white;
border-radius: 10px;
white-space: pre;
content: ' g3 ';
}
.fa-g4:before {
font-family: Arial;
font-weight: bold;
font-size: 12px;
color: black;
background-color:white;
border-radius: 10px;
white-space: pre;
content: ' g4 ';
}
.fa-g1bad:before {
font-family: Arial;
font-weight: bold;
font-size: 12px;
color: white;
background-color:red;
border-radius: 10px;
white-space: pre;
content: ' g1 ';
}
.fa-g2bad:before {
font-family: Arial;
font-weight: bold;
font-size: 12px;
color: white;
background-color:red;
border-radius: 10px;
white-space: pre;
content: ' g2 ';
}
.fa-g3bad:before {
font-family: Arial;
font-weight: bold;
font-size: 12px;
color: white;
background-color:red;
border-radius: 10px;
white-space: pre;
content: ' g3 ';
}
.fa-g4bad:before {
font-family: Arial;
font-weight: bold;
font-size: 12px;
color: white;
background-color:red;
border-radius: 10px;
white-space: pre;
content: ' g4 ';
}
</style>
"""
# the below is just add above CSS class to folium root map
m.get_root().html.add_child(folium.Element(my_symbol_css_class))
# then we just create marker and specific your css class in icon like below
for index, row in df_map.iterrows():
if row['cluster'] == 1 and row['business_condition1'] < 500 :
color='black'
fa_symbol = 'fa-g1'
lay = layer1
elif row['cluster'] == 1 and row['business_condition1'] >= 500 :
color='black'
fa_symbol = 'fa-g1bad'
lay = layer1
elif row['cluster'] == 2 and row['business_condition1'] < 500:
color='purple'
fa_symbol = 'fa-g2'
lay = layer2
elif row['cluster'] == 2 and row['business_condition1'] >= 500:
color='purple'
fa_symbol = 'fa-g2bad'
lay = layer2
elif row['cluster'] == 3 and row['business_condition1'] < 500:
color='orange'
fa_symbol = 'fa-g3'
lay = layer3
elif row['cluster'] == 3 and row['business_condition1'] >= 500:
color='orange'
fa_symbol = 'fa-g3bad'
lay = layer3
elif row['cluster'] == 4 and row['business_condition1'] < 500:
color='blue'
fa_symbol = 'fa-g4'
lay = layer4
else:
color='blue'
fa_symbol = 'fa-g4bad'
lay = layer4

folium.Marker(
location=[row['lat'], row['lon']],
title = row['name']+ 'group:{}'.format(str(row['name'])),
popup = row['name']+ 'group:{}'.format(str(row['name'])),
icon= folium.Icon(color=color, icon=fa_symbol, prefix='fa')).add_to(lay)

The code above shows markers with their name representing each store on the map which separates by 4 colors of 4 groups. Moreover, if some stores have business_condition1 ≥ 500 the marker name is highlighted by a red background to notice that this store cannot operate. The final thing to do is to draw boundaries for each group. We apply ConvexHull to solve this issue which scipy package can help you.

#draw cluster each group
#flat line to group path
#prepare layer and color for each group
layer_list = [layer1,layer2,layer3,layer4]
color_list = ['black','purple','orange','blue']
for g in df_map['cluster'].unique():
# this part we apply ConvexHull theory to find the boundary of each group
# first, we have to cut the lat lon in each group
latlon_cut =df_map[df_map['cluster']==g].iloc[:, 1:3]
# second, scipy already provides the great function for ConvexHull
# we just throw our dataframe with lat lon in this function
hull = ConvexHull(latlon_cut.values)
# and with magic, we can have new lat lon boundary of each group
Lat = latlon_cut.values[hull.vertices,0]
Long = latlon_cut.values[hull.vertices,1]
# the we create dataframe boundary and convert it to list of lat lon
# for plotting polygon in folium
cluster = pd.DataFrame({'lat':Lat,'lon':Long })
area = list(zip(cluster['lat'],cluster['lon']))
# plot polygon
list_index = g-1 # minus 1 to get the same index
lay_cluster = layer_list[list_index ]
folium.Polygon(locations=area,
color=color_list[list_index],
weight=2,
fill=True,
fill_opacity=0.1,
opacity=0.8).add_to(lay_cluster)

# to let the map have selectd layer1 layer2 you created
folium.LayerControl(collapsed=False,position= 'bottomright').add_to(m)
# save it to html then we can send the file to our colleagues
m.save('mymap_clustering.html')

Following through with the code we have the output like the picture of the later data scientist.

The output following through with the code

This example is just a basic use case. I hope it helps all of the readers can get the picture and may adapt it to real business cases. For example, you can use the boundary for geo-fencing to notify when your partners come into each boundary and make some actions to respond to them. Finally, thank you to StackOverflow which helps me create superb interactive maps.

--

--