Exploring Customer Segmentation With RFM Analysis and K-Means Clustering

Divya Chandana
The Deep Hub
Published in
9 min readMay 6, 2021

--

Introduction

Customers Segmentation

Marketers spend a lot to attract new customers as compared to expenses on retaining the current customers, To maintain and extend business, one ought to realize being able to hold existing customers is as crucial as finding new customers. If the rate of customer retention is greater than the rate of new customers, then the database as a whole is reducing if existing customers go off then transactions will be less. In a way, holding current customers’ priority exceeds looking for new customers.

In business, every deal might not be profitable to the customer, every client might not be interested to spend. It is vital to guarantee assets designated or deployed are in line with the benefit a client carries. Marketers’ goal here is to maximize the impact of customized plans focused on targeted customers.

In this medium post, I’m analyzing how to segment the customers using Recency, Frequency, Monetary, and group the customers accordingly. This Analysis will help Marketers who are obsessed with customers and they can group their customers and add deals accordingly.

RFM : Recency, Frequency, Monetary Analysis

RFM is an effective customer segmentation technique where it will be very helpful for marketers, to make strategic choices in the business. It engages marketers to rapidly distinguish and segment customers into similar clusters and target them with separated and personalized promoting methodologies. This in turn makes strides in customer engagement and retention.

Using RFM segmentation marketers can able to target particular clusters of clients and target according to their behavior and in this way create much higher rates of customer response, furthermore expands loyalty and customer lifetime. In general marketers have overall information on their existing customers such as buy history, browsing history, earlier campaign reaction patterns, and demographics, that can be utilized to recognize a particular cluster of customers that can provide offers/discounts/deals accordingly.[1]

Libraries

Numpy provides a high-performance multidimensional array object and tools for working with these arrays.

Pandas is an open-source library built on top of numpy providing high-performance, easy-to-use data structures and data analysis tools.

sklearn Standard Scaler, library is used to normalize the data.

seaborn for graphs visualization

sklearn library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction

Data Collection

I used Kaggle’s Online Retail Data with different columns such as InvoiceNo, Invoice Date, Customer Id, Selling price, quantity, etc. which will be very helpful for the analysis, here is the onlineRetailData

Data Cleaning

In the dataset, I have a lot of null values, which will affect my analysis, I removed null values using the ‘dropna’ method.
In this data set total price of the product is not mentioned it is separated as quantity and unit price, I added a new column naming Amount which contains Quantity times UnitPrice results from the total price
Here, I wanted to see how the amount ranges by grouping amounts with customerID the variation I very high, hence I decided to normalize data to get proper results and better visualization of results in a graph.

Exploratory Data Analysis

info() function from provides an overview f the data like the number of records present in the data and number of columns and data type of column. It gives an overview of what kind of data I’m dealing with.

describe() function generates descriptive statistics include those that summarize the central tendency, dispersion, and shape of a dataset’s distributions.
By looking at it I found a flaw in the dataset which is Quantity has a negative value, which does not make sense I dropped all these values.

RFM + K-Means extended

To understand the behavior of the customer RFM metrics plays a vital role as frequency and monetary value affect a customer’s lifetime value, and recency affects retention, a measure of engagement. Here I’m doing an in-depth analysis of RFM + K-Means as it answers vital questions like who are the best customers, who contribute to churn rate etc..,
Here calculating the frequency of customers by counting Invoice numbers of each customer, the more the count the more often the customer buys from the store.[2]

Calculating Recency, here we are calculating recency by subtracting the very recent date with the last transaction date of the customers.

The fewer number of days resulted from the more recent customer purchased from the store.

We are calculating monetary by summing up all the amounts of the customer. Finally merging all the columns into a data frame.

Outliers detection

I calculated Monetary, Frequency, and Recency values grouped by customer id. I tried to plot these values in the box-plot, this data is suffered from outliers which might cause accurate predictions. Thus I used standard scalar to normalize the data.

Standardization scales each input variable separately by subtracting the mean (called centering) and dividing by the standard deviation to shift the distribution to have a mean of zero and a standard deviation of one.

The standard score of a sample x is calculated as:

z = (x — u) / s

Peek of the data after applying normalizing techniques. [3]

Now after cleaning the data we can clearly see how the data is distributed, for Monetary, Frequency, recency columns.

Normalized data

KMeans Clustering

kmeans clustering image

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into K distinct clusters. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as far as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid is at the minimum. The less variation we have within clusters, the more homogeneous the data points are within the same cluster.

I used two methods to decide K value for the K-Means clustering. one is elbow method and another is silhouette score. [5]

Elbow Curve

A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k.
Here, I tried to plot the cluster numbers as the x-axis and their respective score on the y axis. By observing the graph, using the elbow method dividing the data frame into 3 clusters gives proper results.

Elbow Method

Silhouette Analysis

silhouette score=(𝑝−𝑞)/𝑚𝑎𝑥(𝑝,𝑞)

p is the mean distance to the points in the nearest cluster that the data point is not a part of [6]

q is the mean intra-cluster distance to all the points in its own cluster.

  • The value of the silhouette score range lies between -1 to 1.
  • A score closer to 1 indicates that the data point is very similar to other data points in the cluster,
  • A score closer to -1 indicates that the data point is not similar to the data points in its cluster.

Finalized model with N_clusters = 3 based on above analysis

Again fitting the model with finalized 3 clusters.

Overview of the current data frame

After appending the resultant clustering labels in the last column, the view of the data frame.

Now, the customers are divided into 3 groups, the last cluster people are the ones who spent more.

Here the last group from the cluster are more frequent customers

The less the recent value the recent customer purchased products. Even here the last group has fewer values.

visualizing each cluster

Plotting Recency and monetary: from the graph we can say that the yellow group the ones who like to spend more and they are the recent customers.[4]

Plotting Frequency and Monetary: even here the yellow group of customers tried to purchase more and frequently whereas the blue group is very little frequency and spends very little.

Plotting Frequency and Recency: Even here, the yellow group frequently purchases products and they are the most recency one. The green ones tried to purchase recently but they are not frequent buyers which we can determine that they are the new customers.

Till now we visualized the plots in 2D, it’s better to visualize all three into one plot and come to proper decisions.

This is 3D representation of all the segmented customers

From this plot, we can see some customers have not spent a lot of money but frequently stop by the site and have made purchases recently that also fall into the high-value category.

Bugs encountered

While creating the 3D plot I tried to give color value in the ax. scatterplot but it threw me an error saying that ‘c’s and ‘color’ cannot be together only one should be used. Later I realized color will be the values c and color are same only one should be given the color will assign automatically.

Limitations

We can make extensive analysis on this customer segmentation, like the below-given chart. Due to time limitations and also data, I could do the basic segmentation. The same project can be extended and can be implemented into real-time data, which will be very helpful for the markets, commerce companies.

Conclusion

Smart advertisers understand the significance of “know thy client.” Rather than basically focusing on creating more clicks, advertisers should change in outlook. Instead of examining the whole client base as a whole, it’s smarter to segment them into clusters, comprehend the qualities of each gathering, and engage in them with relevant deals. One of the most famous, simple to-utilize, and successful division strategies to empower advertisers to break down client behavior is RFM with K-Means segmentation.

References

[1]

[2]

https://clevertap.com/blog/rfm-analysis/#:~:text=RFM%20is%20a%20data%2Ddriven,improves%20user%20engagement%20and%20retention.

[3]

[4]

[5]

[6]

--

--