Published in

Geek Culture

# Introduction

What happens in most companies, is you start of with a GREAT spread of customers over your sales force. Then, over time, the waters get muddied and a few years down the line you probably have something similar to figure 1. Note the different colors, denotes different rep’s and see how much they overlap. This article will deal with sorting this mess out.

If you have not worked with KMEAN’s, then here is a good article.

# Required packages

We will be using KMEAN’s as our initial algorithm to group the customers according to their locations. However, usual KMEAN’s doesn’t quite do the trick in our scenario, since KMEAN’s can end up with very unbalanced area’s. One rep could end up with 20 customers and another could end up with 150. So k-means-constrained to the rescue. We can give minimum and maximum for each area. So, first up, install using pip.

pip install k-means-constrained

Now, lets import the packages we need

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import folium
from k_means_constrained import KMeansConstrained

Now import your data into a data frame as I have below in figure 2. I have blocked out some data, but you get the point, you need the following fields:

• AccountID — the customers account numer
• RepID — The sales representative linked to the customer
• Latitude
• Longitude
• Location — A simple concat of latitude,longitude

# Plotting the initial data

You can use below to plot the existing customer base with each representative as a different colour.

lst_reps = sorted(list(dfm.RepID.unique()))
k = len(lst_reps) - 1
# Get our list of unique clusters
lst_elements = sorted(list(dfm.RepID.unique()))
# Now make up random colors for each unique RepID
lst_colors = ['#%06X' % np.random.randint(0, 0xFFFFFF) for i in range(len(lst_elements))]
# Add color column to dataframe and apply colors
dfm["color"] = dfm["RepID"]
dfm["color"] = dfm["color"].apply(lambda x:
lst_colors[lst_elements.index(x)])
m = folium.Map(tiles="cartodbpositron", zoom_start=11)
for index, row in dfm2.iterrows():
folium.CircleMarker(
location=[float(row['Latitude']),
popup=str(row['AccountID']+'|'+str(row['RepID'])+'|'+row['Name']+'|'+row['UserField01']),
color=row['color'],
fill=True,
fill_color=row['color'],
m

# Training parameters

Let’s set up out training parameters.

• K indicates the number of representatives
• We want a minimum of 45 customers per representative
• A maximum of 92 customers per representative
lst_reps = sorted(list(dfm.RepID.unique()))
k = len(lst_reps) -2
min_customers = 45
max_customers = 92

# KMEAN’s to cluster our customers

Now, lets set up our KMEAN’s parameters and run fit_predict. Importantly, we will want to save the centroids for each cluster.

In our case, each centroid will be a central location for each representatives area or territory.

## Create a new dataframe with latitude/longitude
X = dfm[['Latitude','Longitude']]
df_X = X.copy()
### KMeansConstrained parameters
clf = KMeansConstrained(
n_clusters=k,
size_min=min_customers,
size_max=max_customers,
random_state=0
)
clf.fit_predict(df_X)
# save results
th_centroids = clf.cluster_centers_
df_clustercentroids = pd.DataFrame(th_centroids)
df_clustercentroids.columns = ['Latitude','Longitude']
print(clf.inertia_)
# send back into dataframe and display it
dfm['cluster1'] = clf.labels_

Your result is now as below, take note of the new fields in our dataframe

• cluster1 — The cluster

# Assign each cluster to the most common representative

A little like the highest common denominator, lets use an APPLY to find the most common representative in a cluster and assign them to the whole cluster.

# get a copy of the dataframe
dfn = dfm.copy()
clusters = dfn.cluster1.unique()
# Remove unwanted reps
dfn = dfn.loc[dfn['RepID'] != 2195]
dfn = dfn.loc[dfn['RepID'] != -1]
# Iterate through the unique clusters
for cluster in clusters:
# find the most common rep in the cluster
repid = dfn.loc[dfn.cluster1 == cluster].RepID.mode() #.loc[0]
if repid.size == 1:
repid = int(repid.loc[0])
else:
repid = 9999
# assign the cluster to its repid
print(str(repid) + ':' + str(cluster))
dfm.loc[dfm.cluster1 == cluster,'RepID1'] = repid
# Remove rep from dataframe copy so that its not used again
dfn = dfn.loc[dfn.RepID != repid]
dfm = dfm.astype({"RepID1": int})

You can also see a count of how many customers have been clustered per representative:

dfm.groupby(['RepID1','cluster1'])['AccountID'].count()

# Update centroids

Let’s update the repID’s to the centroid as well, just for info purposes

df_clustercentroids['cluster'] = df_clustercentroids.indexdf_clustercentroids['RepID'] = df_clustercentroids.apply(lambda row: -1 if dfm.loc[dfm.cluster1==row.cluster].RepID1.mode().size == 0 else dfm.loc[dfm.cluster1==row.cluster].RepID1.mode()[0],axis=1)df_clustercentroids.head(20)

# Plot the results

Now, lets plot the results and see how we are fairing

# Get our list of unique clusters
lst_elements = sorted(list(dfm.cluster1.unique()))
# Now make up random colors for each unique cluster
lst_colors = ['#%06X' % np.random.randint(0, 0xFFFFFF) for i in
range(len(lst_elements))]
# Add color column to dataframe and apply colors
dfm["color"] = dfm["cluster1"]
dfm["color"] = dfm["color"].apply(lambda x:
lst_colors[lst_elements.index(x)])
m = folium.Map(tiles="cartodbpositron", zoom_start=11)for index, row in dfm.iterrows():
folium.CircleMarker(
location=[float(row['Latitude']), float(row['Longitude'])],
popup=str(row['cluster1']),
color=row['color'],
fill=True,
fill_color=row['color'],
tooltip=str(row['cluster1']) + '|' + row.AccountID
# Plot centroids
for index, row in df_clustercentroids.iterrows():
folium.Marker(
location=row,
popup=str(index),
tooltip=str(index) + '|#=' +
str(dfm.loc[dfm.cluster1==index].groupby(['cluster1']
['AccountID'].count().iloc[0])
m

As you can see, we are already doing much better and plotting the centroids also gives us the central location for each area.

# So, whats the problem?

Well, KMEANS-constrained is great, but the constraints now cause a few outliers which we can use KNN to resolve. Figure 5 highlights the issue for you. Can you see one customer assigned to the brown representative, BUT it doesn’t make sense…. It should actually be with the purple representative.

Easy to spot on a map and lets use KNN to resolve this. If you do not know the KNN algorithm, then have a look at this article.

# KNN to the rescue

First up, lets setup a train and test dataset.

# Import train_test_split function
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(dfm[['Latitude','Longitude']], dfm[['cluster1']], test_size=0.3) # 70% training and 30% test

Now that we classified our data using KMEAN’s, lets train a KNN model to predict based on the KMEAN’s results. This will help us with the following

1. Identify the outliers from KMEAN’s due to its min/max constraints.
2. Have a model we can use going forward to predict which representative an account should be assigned to.

Now that we trained, lets run a prediction on our test set and then measure accuracy. We should see above 92% accuracy of our model

#Import knearest neighbors Classifier model
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
#Create KNN Classifier
knn = KNeighborsClassifier(n_neighbors=k)
#Train the model using the training sets
knn.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = knn.predict(X_test)
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Predict on ALL data

We have a KNN model, so lets predict for latitude/longitude on our entire dataset. This will help show us some outliers where KMEAN’s made an odd choice due to the constraint of min/max customers per rep.

y_pred = knn.predict(dfm[['Latitude','Longitude']])

Update the predictions to our dataframe and mark the failed knnfailer=true when KNN did not agree with KMEANS. New clusters will be in the cluster2 column.

dfm['cluster2'] = y_preddfm['outlier'] = dfm.apply(lambda row: False if row.cluster1 == row.cluster2 else True, axis=1)dfm[dfm.cluster1 != dfm.cluster2]

Note the results in figure 6 with our 2 new columns. The handy outlier field tells us if KNN changed the customer to a new area.

# Update RepID2 from most common RepID1 in cluster

In this step, lets create a repid2 in the cluster are all assigned to the most common RepID1 in the cluster 1. This ensures the outliers get the correct rep code into RepID2.

def getMostCommonRep(row):
global dfm
repid = dfm.loc[dfm.cluster1 == row.cluster2].RepID1.mode()
if repid.size == 1:
return repid.loc[0]
else:
return -1
dfm['RepID2'] = dfm.apply(lambda row: getMostCommonRep(row), axis=1)dfm = dfm.astype({"RepID": 'int32'})
dfm = dfm.astype({"RepID1": 'int32'})
dfm = dfm.astype({"RepID2": 'int32'})
dfm = dfm.astype({"cluster1": 'int32'})
dfm = dfm.astype({"cluster2": 'int32'})

# Lets plot the new results

Now lets plot our results to see if KNN solved our outlier problem:

# Get our list of unique clusters
lst_elements = sorted(list(dfm.cluster2.unique()))
# Now make up random colors for each unique cluster
lst_colors = ['#%06X' % np.random.randint(0, 0xFFFFFF) for i in
range(len(lst_elements))]
# Add color column to dataframe and apply colors
dfm["color"] = dfm["cluster2"]
dfm["color"] = dfm["color"].apply(lambda x:lst_colors[lst_elements.index(x)])
m = folium.Map(tiles="cartodbpositron", zoom_start=11)for index, row in dfm[dfm.outlier == False].iterrows():
folium.CircleMarker(location=[float(row['Latitude']), float(row['Longitude'])],
popup=str(row['cluster2']),
color=row['color'],
fill=True,
fill_color=row['color'],
tooltip=str(row['cluster2']) + '|' + str(row['RepID2']) + '|' + row.AccountID
# Plot centroids
for index, row in dfm[dfm.outlier == True].iterrows():
folium.CircleMarker(location=[float(row['Latitude']), float(row['Longitude'])],
popup=str(row['cluster2']),
color='black',
fill=True,
fill_color='black',
tooltip= 'old=' + str(row['cluster1']) + '| new=' + str(row['cluster2']) + '|' + row.AccountID
m

Great, as you can see in figure 7, our outlier has been reassigned to the purple representative. Note, we do give it a black border so we can easily pick the outliers out.

# Conclusion

I hope this article will help you on using machine learning to cluster your customers geographically. As with most machine learning, you may need more than just one algorithm to get thee right result.

# Sources

A big thanks to this article that got me onto kmeans-constrained. It was a big help.

--

--

--

## More from Geek Culture

A new tech publication by Start it up (https://medium.com/swlh).

## Shaun Enslin

Coding, technology, data, crypto & lots of cycling are my passions. https://twitter.com/Shaunenslin https://www.linkedin.com/in/shaun-enslin-4984bb14b/