Not another Segmentation project!

Rohit Tolawat
Analytics Vidhya
Published in
5 min readNov 1, 2020

Imagine we are asked to develop insights and strategies for our customer base and the intention is to increase the profitability of the company. We are given very little time to come up with a plan. Knowing the fact that our organization would have many customers, developing strategies for each one of them might be redundant, exhaustive, and sometimes counter-productive. An effective tool to circumvent this lies with Clustering.

Clustering is an unsupervised machine learning technique that groups data points based on similarities. We will be focusing on perhaps the most used (or abused) technique called the K-means Clustering, where K refers to the number of segments you desire (yep, you have the power!). I refrain from going into technicalities (you will get enough content explaining these concepts on the web) and focus more on implementation in my articles. Without much adieu, let’s get started!

I am using Tableau’s sample superstore dataset (how convenient, and lazy) for this article. The dataset provides information on revenues (sales) and profit made across the customer base. Tableau has an inbuilt clustering functionality but tuning the hyperparameter (number of clusters, in simple words), is messy. I, therefore, use R to find the optimum number of segments in my dataset.

You can access the project’s code on my Github.

https://github.com/rohitTheSupplyChainGuy/Customer-segmentation-using-K-means-clustering/blob/main/customerSegmentation.Rmd

A plot of sales v/s profit is as below:

Now, let’s take a step back and try to eyeball segments in the data. A liberal (its election season, been watching a ton of late-night shows!) eye would create clusters as below:

Liberal eye

Customers making profit or loss are treated alike as per the above segmentation. But hey, imagine you are a premium customer adding a lot of sales to the company, wouldn’t you want to be treated special?

Here comes the conservative eye, that could segment the data as below:

Conservative eye

The conservative eye adds nuances to the segmentation based on the profit they generate for the company.

Who’s right, who’s wrong? Who’s to judge!

This is where parameter tuning (selecting the right number of segments in the data) using Machine Learning kicks in. Remember that we haven’t standardized data (brought sales and profit on the same scale — comparing apples to apples), the impact of which would be explained later in the article. A useful tool to decide on the number of segments in the data is the Elbow Plot. Let us implement the plot and try and make sense out of it:

The technique used iteratively creates segments (1 to 10 in this case) and measures the intra-segment euclidean distance from the centroid that is randomly placed on the dataset by the algorithm (too heavy).

Source: Medium

Our intention is to find the least sum of squared distance and as noticed, after the 4th segment the decrease in the sum of squared distances isn’t that encouraging. We are sticking with 4 here. The segments created by the algorithm are as below:

We did not scale as yet, and we have an issue

If we look at the segments created above, it seems that sales is the dominant feature and the profit is neglected. This is because we did not scale our features before running the algorithm. As a result, the feature with more variation in the data gets more importance, which in our case is sales as shown below:

Comparing apples to oranges, are we?

Let’s quickly scale the data, and rerun the box plots:

Where x is the data point, Mu is the feature mean and Sigma is the standard deviation
Apple to Apple it is!

Elbow plot and segmented visualization below:

The magical K is 5
Now this makes sense

So, we have segments. Now what?

I must emphasize, Clustering is an effective exploratory data analysis technique. Once you have the segments, it makes the analyst’s and business’s life easy to give customers the treatment they deserve. These segments must now serve as a base to derive insights. I chose to build a dashboard using Tableau that could help the management of the hypothetical organization we are working for.

The link to the dashboard is as below:

https://public.tableau.com/profile/rohit.tolawat#!/vizhome/Customersegmentation_16044225689930/CustomerAnalysis

Based on the dashboard built, below could well be our insights and the strategy henceforth:

We make an overall profit of 12% with respect to the revenues with segment 3 being most profitable in percentage terms (42%). Could the practices followed for these 6 customers be replicated across the other segments.

Segment 2 is highly unprofitable losing 30% with respect to sales. It contributes to an overall loss of 15% to the organization. We must analyze the 21 customers that form this segment. Analyze if shipments to these customers are inefficient or renegotiate the contracts with better pricing.

Product subcategories bookcases, supplies, and tables are loss-making. It could well be incorrect pricing. Must we continue with these products or eliminate SKUs belonging to these subcategories?

Summary:

Our intention was to segment the customers. We attempted to do so without scaling profit and sales only to see that the segments created by the algorithm did not make sense.

After scaling the features, we used the Elbow plot and finalized creating 5 segments.

Visually analyzed the segments (possible since we only had 2 features) to realize that it made sense.

Developed an interactive dashboard to provide insights and recommendations thereafter.

--

--