Clustering Algorithm for Driver Segmentation
Grouping drivers based on mean distance driven per day and mean over-speed percentage
Hello and welcome to my article! In this post, you’ll learn how to use clustering techniques when you have unlabeled data. Using KMeans clustering, you’ll be able to cluster them by taking into account various characteristics. Let’s get started.
Dataset Description
Find the dataset here. For the sake of simplicity, take only two features:
- mean distance has driven per day
- the mean percentage of time a driver was >5 mph over the speed limit
Here are what the data represent:
- id: Unique Id of the driver
- mean_dist_day: Mean distance driven by driver per day
- mean_over_speed_perc: Mean percentage of time a driver was > 5 mph over the speed limit
(1) Import required libraries and dataset —
import numpy as pn
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings(‘ignore’)
%matplotlib inline
First, import the required libraries and then import the dataset.
df=pd.read_csv('driver-data.csv')
df.head()
(2) Check information and essential data distribution—
# Check the no. of records - it should be 4000
df.info()# Check the basic distribution of data
df.describe()
We’re ready to use the clustering technique, but there’s an issue. What number of clusters should we use? So, we’ll utilize the Elbow approach to determine the number of clusters.
(3) Determine the number of clusters using the elbow method—
WCSS (Within Cluster Sum of Squared) is the sum of the squared distance between each point in a cluster and its centroid. When we plot the WCSS with the K value, we get an Elbow.
The WCSS value decreases as the number of clusters grow. When K = 1, the WCSS value is the highest.
We will use Yellowbrick, It is a Python library combining scikit-learn and matplotlib. Yellowbrick enhances the Scikit-Learn API to make a model selection and hyperparameter tweaking easier. Matplotlib is used in the background.
!pip install yellowbrick #if you haven't install yellowbrick yet
from yellowbrick.cluster import KElbowVisualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(1,12)).fit(df)
visualizer.show()
Here, an optimal number of clusters = 3 or k = 3.
(4) Run the algorithm with K=3 —
# using the KMeans from sklearn
from sklearn.cluster import KMeans# create an instance of a k-means model with 3 clusters
kmeans = KMeans(n_clusters = 3)
df_analyze = df.drop('id',axis=1)# fit the model to all the data, except for the id label
kmeans.fit(df_analyze)# get cluster center vectors
kmeans.cluster_centers_# Output
array([[ 50.04763438, 8.82875 ],
[ 180.017075 , 18.29 ]])
Let’s check the labels of the data points. The size of labels should match the dataset count.
# check the labels of the data points
kmeans.labels_# how many drivers are there in 1st, 2nd and 3rd cluster
from collections import Counter
Counter(kmeans.labels_)# Output
Counter({0: 3200, 2: 104, 1: 696})
Now we’re ready to plot, but first, let’s make a column that indicates the cluster number of each data point.
# create a column for cluster label
df_analyze['cluster'] = kmeans.labels_# plot the data
sns.set_style('whitegrid')
sns.lmplot(x = 'mean_dist_day', y = 'mean_over_speed_perc',
data = df_analyze, hue = 'cluster', palette = 'cubehelix',
size = 5, aspect = 2, fit_reg = False)
Insights
I have drawn some insights which are as follows:
- There are 696 drivers who travel more than 110 miles per day on average yet have an average over-speed percentage of less than 40%.
- Out of 4000 drivers, 104 enjoy over-speeding (>40%) and travel more than 110 miles per day on average.
- 3200 drivers drive fewer than 90 miles per day and have a maximum of 60 percent over-speeding rate.
Thank you for reading! I would appreciate it if you follow me or share this article with someone. Best wishes.