Analytics Vidhya
Published in

Analytics Vidhya

Clustering and profiling customers using k-Means

Photo by Anthony Intraversato on Unsplash
  • Conversion of input sales data to a feature dataset that can be used for clustering
  • Performing clustering exercise
  • Profiling the clusters, and
  • Setting up a regular scoring process to assign cluster labels basis new data.

What is Clustering?

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. These groups are called clusters and the similarity measure of objects can be determined in multiple ways.

Photo by Jeremy Zero on Unsplash
  • Partition based, e.g. k-means
  • Hierarchical
  • Density based, e.g. DBScan, Optics
  • Grid based, e.g. Wave-cluster
  • Model based, e.g. SOM

Sales data

We have sales data that captures date, customer id, product, quantity, dollar amount & payment type at order x item level.

  • order_item_id refers to each unique product within each order
  • We have corresponding dimension tables for customer info (customer_id), product info (product_id), and payment tender info (payment_type_id)

Clustering Exercise

In order to cluster customer basis their transactions data, we need to get the data in the correct format required for the clustering exercise.

Features data set

We will require a dataset that summarizes customer activity over time into a customer_id level dataset, i.e. we need to depict each customer using 1 row for data that covers everything we know about the customer.

  1. Sales
  2. Quantity
  3. No. of orders
  4. Avg. order value
  5. Units per transaction
  6. Avg. unit revenue
  7. No. of different products bought
  8. No. of different product categories bought
  9. No. of different payment types used
  1. Split of category level sales as % of total sales
  2. Split of category level units as % of total units
  1. Split of tender type level sales as % of total sales
  1. Omni shopper flag
  2. Email subscription flag

k-Means clustering

Once we have the features dataset ready, we will follow below steps to get clusters from this data.

  1. Null treatment
  2. Feature scaling
  3. Running multiple iterations of k-means with varying k
  4. Using the elbow plot to determine the optimum value of k
  5. Getting k clusters
Elbow plot
Final clustering iteration


Profiling is the most important part of the clustering exercise as it helps us in understanding what the clusters actually are in terms of customer behavior and define them in a business-usable fashion.

Cluster level data summary

To help with cluster profiling, we will be using pandas describe() method. You can use other ways as well to help summarize all features across all clusters and determine the profiles of each cluster. Visualization tools such as Tableau can also be used for this purpose by loading in the labelled clustering output.

Using color scales from Excel to Analyze variability across clusters


Now that we have done this exercise and have profiled and understood how the customers look like, we might want to understand how our new customers look like or how do existing customers change over time as they continue to shop with us.

  1. Take latest transactions data
  2. Generate the clustering features we have generated earlier in this exercise
  3. Use the saved model objects to scale the feature dataset and then assign cluster labels to them, and
  4. Save the cluster labels back to the database to be used for reporting and monitoring purposes.



Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store