Customer Segmentation using RFM Analysis and K-Means Clustering in Python

Omotolani Kehinde
8 min readJan 3, 2023

--

Table of Contents

  • Introduction to Customer Segmentation
  • Data Preparation and Cleaning
  • Exploratory Data Analysis
  • Create Recency Frequency Monetary (RFM) Analysis
  • Standardization and K-Means for Segmentation
  • Conclusion

Customer segmentation is the process of dividing your users or customers into segments using similar characteristics like purchase frequency, the amount spent, demographics, behavioral patterns and so much more.

This allows organizations to develop more targeted sales, retention, and marketing strategies for customer segments.

Customer segmentation can be broken down into two types:

  1. Segmenting customers based on their personas: these segments customers based on relevant user data criteria like demographics and geographic-related information.
  2. Segmenting customers based on their behavior: these segments customers based on their spending behavior, how often, and what products they buy.

This analysis focuses on segmenting our customers using the behavioral patterns seen in their purchase history. We analyzed the trends seen in each demographic to help understand the market performance of each product in the operating countries but we decided to do a behavioral segmentation because understanding your customers' patterns makes it easier to create a better-personalized experience for each segment. We created a Recency frequency monetary pie chart and cluster pie chart(using k means clustering) for this project.

We got the data from the UCI Machine Learning website. We are working with the Online Retail dataset. The dataset is an e-commerce dataset that contains transactions from December 1st, 2010 until December 9th, 2011 for a UK-based online retail store. You can access the dataset from here.

Imports

So we had to bring in a couple of guys to help with data processing, visualizations of our insights, and some other components to help perform some tasks along the way.

Here is some basic information about our dataset(Month and year were extracted by me)

There are 541909 observations for 8 predictors.

Data Preparation and Cleaning

  • Converting InvoiceDate to the date format and extracting years and months
  • Checking and removing negative values in our dataset:- it can be seen that there are orders with negative quantity and unit price — most likely returns. While returns can be analyzed to gather some insights and trends, this study won’t be exploring that.
  • Dropping Duplicates

We must drop repeated information to avoid contaminating our dataset.

  • Missing Values

We had a total of 135080 missing Customer IDs but I decided not to drop this because this information is not completely missing, while we have the IDs missing we have all the major sales data for each transaction and these sales still translated to profit for the online retail store and it can help our analysis.

Exploratory Data Analysis

  • Stock Code Distribution across the available countries

Stock Code is the Product(item) code uniquely assigned to each distinct product. This analysis gave us a general idea of the total amount of unique products ordered per country.

  • Top 10 Countries with the highest sales.

Our data shows that the majority of the sales are made from the United Kingdom which was a total of 9,001,744.094 for 12 months

  • Which Item was purchased more often?

From the results, we observe that customers ordered WHITE HANGING HEART T-LIGHT HOLDER 2028 times. This shows the order counts not the quantity or total amount purchased.

  • Top twenty of the most sold items(In terms of quantity)

While WHITE HANGING HEART T-LIGHT HOLDER was purchased more often, The online retail store sold more Paper Craft Little Birdie than any other product.

  • What month did we have our best sales?

From our results, we see that the company made the highest sales in November 2011.

  • When do customers tend to purchase products?

There are no orders between the hours of 19:0pm and 6:00 am on the online retail store. We can see that 12:o0pm is the most active hour for the online retail store. We have the highest amount of orders between the hours of 11 am to 2 pm each day.

  • Best Selling Product for each Country

This shows the best-selling product for each country and its percentage contribution to the total amount of sales for the country.

Recency Frequency Monetary (RFM) Analysis

RFM (Recency, Frequency, Monetary) analysis is a customer segmentation technique that uses past purchase behavior to divide customers into groups.

RECENCY (R): Days since last purchase

FREQUENCY (F): Total number of purchases

MONETARY VALUE (M): Total money this customer spent.

These RFM metrics are important indicators of a customer’s behavior because the frequency and monetary value affect a customer’s lifetime value, and recency affects retention, a measure of engagement.

The RFM value was assigned to each customer as follows:

  • Recency = Number of days the customer’s latest invoice date compared to the latest invoice date among all customers.
  • Frequency = Number of transactions of the customers.
  • Monetary=Spend of the customers.

After that, scoring was assigned to each metric using qcut(Quantile-based discretization function) where q was equal to 5. This helps create an unbiased segmentation by letting pandas figure out how to divide up the data based on the distribution of the data.

Higher frequency and monetary will have higher scores while lower recency will have higher scores (recent purchase). After that, an RFM score will be determined by combining the score of all RF(Recency and Frequency) metrics for each customer after which we grouped each RFM score into segments for better storytelling.

We have 15% of customers considered as Retained Customers. We can put a lot of effort into improving the experience of these customers since they account for a large portion of our revenue.

38% of customers here are considered Lost customers due to an extremely low RFM Score. We can also put in the effort to bring a percentage of this group back on board with us.

In summary, this result can help the product or marketing team design strategies to fit each customer segment.

K-Means Clustering

In addition to segmentation by RFM analysis, K-means clustering can also be used to understand customer segmentation. Standardization is useful when your data has varying scales and the algorithm you are using does make assumptions about your data having a Gaussian distribution.

K-Means clustering is greatly influenced by the scale of the data, it assumes that each cluster adheres to a unimodal distribution, such as Gaussian and our RFM score has varying scales so we decided to standardize the data before clustering.

From the KMeans clustering, We can sort every customer into 3 different clusters based on the similarity seen in their RFM results and calculate the mean of each output.

The final output should look like this:

Besides the output, we can analyze the segments using a 3D scatter plot. This gives us a good visualization of how the cluster differs from each other and how they were segmented using the RFM metrics.

We can also create a pie chart to calculate and assign a percentage to each cluster. This helps the necessary team understand the store’s sales data

In summary, Cluster 0(74% of our customers) is most likely a cluster of new customers. They are customers with very low recency and a good level of frequency, which means they have been active recently and it would be a great idea to design retention strategies to ensure they stay active.

Cluster 1(26% are the customers) is most likely a cluster of churned customers. Our analysis shows that they buy at the lowest frequency, spend the least money, and have not purchased anything in a long time.

Cluster 2(1% of our customers) is most likely our exceptional customers. They are the most active set of customers, they buy at the highest frequency, and spend the most money.

Conclusion

We explored the dataset to make sure we get a detailed understanding of our customer personas before trying to segment these customers. This is important because it helps you make data-driven decisions throughout your implementation process. We segmented our customers into 9 groups in our RFA analysis and 3 clusters in our clustering analysis. This type of analysis helps marketing and product teams to design more efficient and personalized strategies while saving cost and money.

You can find the jupyter notebook on my GitHub.

Thank you for reading!

--

--