Clustering and profiling customers using k-Means
Following article walks through the flow of a clustering exercise using customer sales data.
It covers following steps:
- Conversion of input sales data to a feature dataset that can be used for clustering
- Performing clustering exercise
- Profiling the clusters, and
- Setting up a regular scoring process to assign cluster labels basis new data.
I have also created a Jupyter Notebook with this process and posted it on GitHub. This notebook has been referenced in the article below.
What is Clustering?
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. These groups are called clusters and the similarity measure of objects can be determined in multiple ways.
It is an unsupervised learning method that attempts to determine the underlying structure in the data without the help of any labels.
It could also be considered as an Exploratory process which helps us in discovering hidden patterns of interest or structure in data.
Clustering is extensively used in industry applications like customer segmentation. Customer segmentation has various business applications and hence is a very important skill for a data scientist/analyst.
There are various available algorithms to perform clustering that can divided into following groups:
- Partition based, e.g. k-means
- Hierarchical
- Density based, e.g. DBScan, Optics
- Grid based, e.g. Wave-cluster
- Model based, e.g. SOM
You can find various implementation of some or all of the above algorithms in Scikit-learn and Spark ML Lib.
We will using k-Means implementation from scikit-learn 0.24.2 on Python 3.7.9 for the following walk through.
A walkthrough for kMeans using PySpark is also available on my git repo here.
We are taking the example of an e-commerce business that sells fashion merchandise on their website.
For the sake of simplicity, we are only going to look at sales data, although in an actual e-com business you will also have huge amounts of clickstream data from their website and app platforms. Clickstream data is extremely useful in understanding the performance of your website and app and determining what drives your customers to visit your platform and buy.
Sales data
We have sales data that captures date, customer id, product, quantity, dollar amount & payment type at order x item level.
- order_item_id refers to each unique product within each order
- We have corresponding dimension tables for customer info (customer_id), product info (product_id), and payment tender info (payment_type_id)
Below are sample rows from each of the table.
Transactions table:
Customer information:
Product information:
Payment information:
Please note that the data used is synthetic and created in excel. The files were saved as csv to be imported into python as pandas dataframes.
Clustering Exercise
In order to cluster customer basis their transactions data, we need to get the data in the correct format required for the clustering exercise.
Features data set
We will require a dataset that summarizes customer activity over time into a customer_id level dataset, i.e. we need to depict each customer using 1 row for data that covers everything we know about the customer.
The required output table should have 1 row per customer with relevant information as feature columns. There are many possible features that you can create here, some as simple as sales $ for 1 year, and some complicated or derived such as quarterly change in sales of a customer for a particular product category.
This step of the process will involve conversations with the business stakeholders to determine what features matter to the business.
Here is a sample of how the output data will look like:
Features list
The synthetic data we have is for the calendar year 2020 and we will take the whole timeframe. In your case, you might have to add relevant filters and join the sales dataset with some other datasets to get the required data that can then be aggregated to get below features.
Overall level:
- Sales
- Quantity
- No. of orders
- Avg. order value
- Units per transaction
- Avg. unit revenue
- No. of different products bought
- No. of different product categories bought
- No. of different payment types used
Category level:
- Split of category level sales as % of total sales
- Split of category level units as % of total units
Tender type level:
- Split of tender type level sales as % of total sales
Customer information:
- Omni shopper flag
- Email subscription flag
Other demographics data will be kept only for profiling.
Below are a few code snippets that generate the required features from the sales dataset. Complete code flow can be found on GitHub here.
k-Means clustering
Once we have the features dataset ready, we will follow below steps to get clusters from this data.
- Null treatment
- Feature scaling
- Running multiple iterations of k-means with varying k
- Using the elbow plot to determine the optimum value of k
- Getting k clusters
Null treatment
We need to apply appropriate null treatment to all of our features as the k-means algorithm doesn’t work with nulls in the features.
Various different features may require different null treatments which can be determined from what it means when the feature is null. For example, in our features dataset, there are nulls in category and payment percentage split features. Nulls here simply mean that the customer has not bought that particular category or not used that particular payment method. Hence, we will fill them with 0.
Feature Scaling
All our features lie on different scales. For example, sales feature could go from a few hundred dollars to thousands of dollars whereas units will go from tens to hundreds. Percentage split features will lie in [0,1].
In order to ensure that no one feature over-shadows all other features in the distance calculations, we need to get all features on the same scale.
We will take the percentage features directly without scaling as they already lie in [0,1] but we will scale other features to [0,1] using MinMaxScaler().
We will also save the scaler object as a pickle dump in order to use it later while scoring new data.
k-means iterations
k-means algorithm requires user input on how many clusters to generate, denoted by the k parameter. Determining number clusters can be difficult unless there is a specific business requirement for a certain number of clusters.
Elbow plot is one method of determining the optimum number of clusters from data. To do this, we iterate through the data and generate clusters for different number of k starting at 2. k-means algorithm gives the sum of squared distances of each data point from the centroid of its assigned cluster, known as the inertia score. We then plot the inertia score from each iteration against the k value. The generated graph generally has an elbow shape and hence the name elbow plot.
The elbow point represents the k-value beyond which the reduction in inertia achieved by increasing k is negligible and hence it is optimum to stop at that point.
We will run through our features dataset and get inertia scores for k in [2,20].
Elbow plot:
We do not have a very distinct elbow point here and generally distinct elbows rarely come out in actual data. The the optimum value of k can be around 4–6 from above plot as inertia continuous to drop steeply at least till k=4.
We can use silhouette score, which is another cluster quality measure, to choose the best k among 4–6. We can also take business inputs here to determine what would be a practical value of k.
Once we have determined the value of k, we generate the final clustering object and save it for scoring process. We will also get the cluster labels for all the records and save it.
Profiling
Profiling is the most important part of the clustering exercise as it helps us in understanding what the clusters actually are in terms of customer behavior and define them in a business-usable fashion.
Cluster level data summary
To help with cluster profiling, we will be using pandas describe() method. You can use other ways as well to help summarize all features across all clusters and determine the profiles of each cluster. Visualization tools such as Tableau can also be used for this purpose by loading in the labelled clustering output.
describe() method can be used directly to get overall level summary of the dataset for all features. It can then be used with groupby() to get summary for all features for each cluster.
Combining these two together, we get a picture of how each feature varies across the clusters and how it behaves at overall level. This will help us in drawing conclusions about how the clusters differ from each other and from the average customer overall.
Please note that mean/median are good measure of central tendency and help in easy comparison of multiple groups but they are not the complete picture. The describe() method gives all major distribution points for all the features and we can tweak above codes to compare not just mean/median but the whole distribution of features across clusters. These can also be better analyzed by plotting histograms of each feature split by clusters.
Now that we have the dataframe containing all the features across all clusters, we can export this to Excel and analyze it. Below is a snapshot of the Excel analysis performed using color scales to gauge the variability of the features.
Clusters 1/3/5 are all high sales and clusters 0/2/4 are all low sales. There isn’t much variation in share of category seen across clusters although it is usually expected that a good number of customers would buy from one or two categories only.
The data used above does not have any strong differentiating markers except email subscription and omni shopper flags and the high-low split between 1/3/5 and 0/2/4.
In real data you could have clusters such as high value shoppers, low frequency shoppers, or clusters showing affinity towards a single product category: category B shoppers, etc.
Note: You could also go back and change the number of clusters in case the profiles aren’t satisfactory.
Scoring
Now that we have done this exercise and have profiled and understood how the customers look like, we might want to understand how our new customers look like or how do existing customers change over time as they continue to shop with us.
For this, we will setup a scoring process that will take latest transactions data and assign cluster ids to the customers basis the above analysis.
I have split the scoring part into two steps: feature generation and cluster assignment. Two python scripts have been generated and uploaded to git here that can be used as reference.
Overall, the process should:
- Take latest transactions data
- Generate the clustering features we have generated earlier in this exercise
- Use the saved model objects to scale the feature dataset and then assign cluster labels to them, and
- Save the cluster labels back to the database to be used for reporting and monitoring purposes.
The scoring codes are python scripts and not Jupyter notebooks as it is easier and better to work with scripts to setup production processes like scoring transactions data to assign clusters.
Over time the customer base may shift in behavior and new sub-groups of customer activity might emerge. It could be due to change in product offerings, change in way of doing business (such as re-design of the app or website), change in target audience for the business or could be due to a general shift in market landscape. At such point, we may have to re-do the clustering exercise to understand the new customer sub groups and their profile.
Thank you for reading this article. Please reach out to me via comments in case you have any questions or any inputs.
You can find more python related reference material on my git repo here.