Exploring Customers Segmentation With RFM Analysis and K-Means Clustering With Python.

What are the similar behaviors of your customers? What are the answers to questions in business? Customers segmentation is the solution but how we do it? How can you make use of it for decision making? RFM analysis is applied here with Python, exhibiting its simplicity and use of most basic set of information available with purchasing records.

Hshan.T
The Startup
8 min readNov 1, 2020

--

Introduction

How much do you spend to attract new customers, as compared to the expenses on retaining the existing? To sustain and expand business, one should realize being able to retain existing customers is as important as exploring new customers. If the rate of customers leaving is greater than rate of new customers entering, our customers database is actually shrinking. To certain extend, we see customers retaining effort outweighs searching for new potential customers.

Not every deal is profitable, not all the customers are financially attractive to the business. It is crucial to ensure resources allocated or deployed are in line with profit or value a customer carries. Marketing goal is to maximize influence of customized plans on targeted customers.

RFM Analysis

When we are provided with raw data extracted from database, it might be messy and non-informative to look at individual records. RFM analysis is applied to present data at aggregate level and is used to segment customers into homogenous groups. It has been adopted in business since long ago, especially as part of marketing effort. Three main variables as suggested by the title of analysis, R-recency, F-frequency, and M-monetary, are defined and computed. These three values are important as F and M indicate value of customers, and R indicate customers’ engagement and satisfaction. The values are easy to obtain from the basic set of information for each purchasing history.

RFM variables.

Typically RFM helps to answer the following questions:

  1. Who is the best / most valuable customers?
  2. Who are the target of coming marketing events? Who are potential to be churned? (Who to retain and how to retain?)
  3. Who are churned? What is the churn rate?
  4. Who are the possible customers for the launch of new products?

This is a sharing of rather general or most commonly used implementation of RFM scoring. Application of RFM analysis in different sectors by different analysts gives different definition for those variables.

R — Higher score implies that particular customer has recently made a purchase and will most likely response to current promotion. Low R score reveals the possibility of being churned.

F — Higher score implies customer has made repeated purchasing at higher frequency. (High demand / Loyalty)

M — Higher score implies purchasing at larger amount. (High value customer)

There are advantages and drawback associated with this analysis method.

Advantages:

  • Can be done with least sets of variables. Hence, cost effective in data storage and collection.
  • Fast and easy to implement and understand.
  • Useful for short term marketing plans.

Drawbacks:

  • Some of these variables might not be too helpful for decision making. For example, we do not expect a door making business to record a high R and F. Most samples will record similar low R and F but high M.
  • Independent variables assumption. Possible distortion of analysis due to co-linearity of variables, such as F and M. But there are suggestion to rectify this by averaging the monetary values over frequency.
  • Result is drawn from historical data, impact is limited to existing customers and is less useful for new customers.

Obviously, the most important criteria for this analysis is specifying period of time to be investigated and analyzed. Analysis might be biased and inaccurate if we are trying to span an extremely long duration. It is essential to make a reasonable judgement on suitable period of time. In most basic RFM analysis, for each variable with score 1 to n, we are dividing data into n equal-sized group. So, there are 1/n of samples have same score. (For example, selecting top 20% with lowest value for R variable and allocating score of 5. Subsequent 20% for score 4, and continues.)

Illustration of RFM scoring method.

There are some cases where a composite score is computed from RFM scores assigned. But, further consideration is needed. Are these three variables carry same weightings? Under most circumstances, they are not equally weighted. We might adjusting relative importance/weights of each variables to get composite score. Generally, types of products and businesses have impact on deciding weighted composite score. For example, it is expected electrical appliances to record high M, low F and low R, analysis result will be biased towards emphasizing M with equally weighted RFM composite score. Composite score approach is not illustrated here. Instead, we look into RFM scores separately.

Steps as below:

  1. Aggregating and computing RFM variables for each ID.
  2. Assigning RFM score.
  3. Segmenting customers according to scorings.
  4. Analyzing characteristics/trait of targeted clusters members.

Implementation on Real Data on Python

This is unsupervised learning where we are not provided with target variable. Customers are grouped according to their similarities. Dataset for duration of 2 years used for this section is modified data extracted from a beverages distributor. List of variables:

Table 1.

Excluding ID variables, we are actually left with a small set variables to be analyzed. RFM variables is generated from ‘amount’, ‘date’ and ‘invoice no’. This dataset best demonstrates advantage of RFM approaches, using least set of variables.

R — Duration between analysis date and latest purchase date.

F — Count the number of invoices for each customer ID.

M — Sum of purchasing amount over the period of time for each customer ID.

Here, we would like to explore two type of analysis.

  1. Raw calculated RFM variables + K-Means Clustering
  2. RFM scoring + K-Means Clustering

Raw calculated RFM variables + K-Means Clustering

Step 1: Checking and preprocessing data.

Step 2: Computing and visualizing RFM variables.

Figure 1: RFM variables ranges are slightly wide varied
Table 2: Simple Descriptive Statistics for RFM.
Figure 2: Correlation Heatmap.

Step 3: Data Normalizing. The range of variables shows large variation. K-Means is distance based, so adjusting range common range is required to avoid building biased model.

Table 3: Normalized data Statistics.

Step 4: Segmenting with K-Means. Identify the optimal k.

Figure 3: Plot of Inertia against k. ‘Elbow’ at k=5, where the decrement in inertia after k=6 is insignificant, it does not worth to further complicate the model.
Figure 4: Silhouette Plot. Further visualize the selected optimal k=5
Figure 5: Scatter Plot.

Step 5: Identify the clusters for further analysis.

Figure 6: Line plot for each clusters.
Table 4: Statistics summary for each clusters.

From the line plots and statistics summary, cluster 3 is the most valuable group of customers with highest mean F (purchase most often), second lowest mean R which is not much higher than the lowest R for cluster 0 (has recently purchase from the company), and the highest mean M (high purchasing amount). Cluster 2 is the worst group with lowest F and M and highest R.

RFM Scoring + K-Means

Adopting the RFM dataset computed previously. All variables are distributed with a scoring system of 1 to 5. Scaling is not required. (5–5–5 is the best customer, 1–1–1 is the least valuable.)

Step 1: Assigning RFM scores.

Figure 7: Correlation Heatmap for RFM Scores.

step 2: Segmenting with K-Means. Identify the optimal k.

Figure 8: Plot of Inertia against k. ‘Elbow’ at k=4, where the decrement in inertia after k=4 is insignificant, it does not worth to further complicate the model.
Figure 9: This silhouette result appears to be better than the analysis using RFM variables values. Width of ‘knife’ shape bars are more consistent.
Figure 10: Scatter Plot.

Step 3: Identify the clusters for further analysis.

Figure 11: Line plots for each cluster by RFM scores.
Table 5: Statistics summary for RFM variables of each clusters. (Not RFM scores. Result about RFM scores for each cluster is visualized on Figure 11)
Table 6: Simple Interpretation of Result.

As stated earlier, this analysis illustrates the strong correlation between ‘Frequency’ and ‘Monetary’. This analysis yields a better K-Means result on Silhouette plot with more consistent width for the bars at k=4. This shows that size of each cluster is more consistent, as we can see column ‘count’ on Table 5 with maximum difference of only 23. On Figure 11, cluster 0 and cluster 2 have higher F score and M score than remaining clusters, but showing a large difference for R score. In term of R score, cluster 2 is much lower than cluster 0, hence it is a better option. In contrast, cluster 1 is the worst group, with low F score and M score but highest R score, which means members of the cluster purchase less often at lower amount and it has been some time since their last purchase. There is possibility of being the churned group depending on the analysis duration defined. There is chance to increase value of cluster 1 by boosting their purchases since they purchase recently and lower F and M may suggest them to be group of new customers. It is suggested to conduct more in-depth analysis on that particular cluster.

Summary

RFM analysis can segment customers into homogenous group quickly with set of minimum variables. Scoring system can be defined and ranged differently. We get a better result for clustering steps by applying scoring rather than using the raw calculated RFM values. Therefore, segmenting should be done by RFM scoring and further analysis on the spending behavior should be done on the raw values for the targeted cluster to expose more insight and characteristics. RFM analysis solely depends on purchasing behavior and histories, analysis can be further improved by exploring weighted composite scoring or including customer demographic information and product information. A good analysis can increase effectiveness and efficiency of marketing plans, hence increase profitability at minimum cost.

Note: Here is a post about segmentation by DBSCAN model using same set of data , you may have a look if interested.

--

--