Clustering Algorithm for Driver Segmentation

Grouping drivers based on mean distance driven per day and mean over-speed percentage

Dhruval Patel
CodeX
3 min readJun 14, 2022

--

Photo by Nicolas Peyrol on Unsplash

Hello and welcome to my article! In this post, you’ll learn how to use clustering techniques when you have unlabeled data. Using KMeans clustering, you’ll be able to cluster them by taking into account various characteristics. Let’s get started.

Dataset Description

Find the dataset here. For the sake of simplicity, take only two features:

  1. mean distance has driven per day
  2. the mean percentage of time a driver was >5 mph over the speed limit

Here are what the data represent:

  • id: Unique Id of the driver
  • mean_dist_day: Mean distance driven by driver per day
  • mean_over_speed_perc: Mean percentage of time a driver was > 5 mph over the speed limit

(1) Import required libraries and dataset —

First, import the required libraries and then import the dataset.

(2) Check information and essential data distribution—

We’re ready to use the clustering technique, but there’s an issue. What number of clusters should we use? So, we’ll utilize the Elbow approach to determine the number of clusters.

(3) Determine the number of clusters using the elbow method—

WCSS (Within Cluster Sum of Squared) is the sum of the squared distance between each point in a cluster and its centroid. When we plot the WCSS with the K value, we get an Elbow.

The WCSS value decreases as the number of clusters grow. When K = 1, the WCSS value is the highest.

We will use Yellowbrick, It is a Python library combining scikit-learn and matplotlib. Yellowbrick enhances the Scikit-Learn API to make a model selection and hyperparameter tweaking easier. Matplotlib is used in the background.

Elbow for KMeans Clustering

Here, an optimal number of clusters = 3 or k = 3.

(4) Run the algorithm with K=3 —

Let’s check the labels of the data points. The size of labels should match the dataset count.

Now we’re ready to plot, but first, let’s make a column that indicates the cluster number of each data point.

KMeans Cluster

Insights

I have drawn some insights which are as follows:

  1. There are 696 drivers who travel more than 110 miles per day on average yet have an average over-speed percentage of less than 40%.
  2. Out of 4000 drivers, 104 enjoy over-speeding (>40%) and travel more than 110 miles per day on average.
  3. 3200 drivers drive fewer than 90 miles per day and have a maximum of 60 percent over-speeding rate.

Thank you for reading! I would appreciate it if you follow me or share this article with someone. Best wishes.

Your support would be awesome❤️

--

--

Dhruval Patel
CodeX

I write technical blogs explaining my Data Science project walkthroughs and the concepts relating to Data Science