Marketing Analytics: RFM Modeling

Analytics Vidhya
Published in
10 min readJun 15, 2020



When it comes to marketing, if you’re trying to talk to everybody, you’re going to have a difficult time reaching anybody. Vague and generic messages are far less likely to resonate with audiences than specific, direct communication — which is why targeting in marketing is so important.

Smart marketers understand the importance of “know thy customer”. Instead of simply focusing on generating more clicks, marketers must follow the paradigm shift from increased CTRs (Click-Through Rates) to retention, loyalty, and building customer relationships.

Instead of analyzing the entire customer base as a whole, it’s better to segment them into homogeneous groups, understand the traits/behavior of each group, and engage them with relevant targeted campaigns.

One of the most popular, easy-to-use, and effective segmentation methods which enable marketers to analyze customer behavior is RFM segmentation.

Table of Contents:

  1. Introduction
  2. Data Preprocessing
  3. Exploratory Analysis
  4. RFM Modeling
  5. Conclusion

1. Introduction

RFM stands for Recency, Frequency, and Monetary value, each corresponding to some key customer trait. These RFM metrics are important indicators of a customer’s behavior because the frequency and monetary value affect a customer’s lifetime value, and recency affects retention, a measure of engagement.

Businesses that lack the monetary aspect, like viewership, readership, or surfing-oriented products, could use Engagement parameters instead of Monetary ones. This results in using RFE (Recency, Frequency, Engagement) — a variation of RFM. Further, this Engagement parameter could be defined as a composite value based on metrics such as bounce rate, visit duration, number of pages visited, time spent per page, etc.

In this article, we are going to work with the online-retail dataset from the UCI Machine Learning repository.

Dataset Information:

This is a transnational data set that contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Attribute Information:

  1. InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter ‘c’, it indicates a cancellation.
  2. StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
  3. Description: Product (item) name. Nominal.
  4. Quantity: The quantities of each product (item) per transaction. Numeric.
  5. InvoiceDate: Invoice Date and time. Numeric, the day and time when each transaction was generated.
  6. UnitPrice: Unit price. Numeric, Product price per unit in sterling.
  7. CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
  8. Country: Country name. Nominal, the name of the country where each customer resides.

In this article, I will guide you through the complete process of RFM modeling and analysis to segment customers based on their transactional behavior.

2. Data Preprocessing

The Primary step in any modeling/analysis is the data preprocessing.

To do the analysis, we need the data. Here is the link for the dataset.

First, import all the necessary libraries, load the data and check for data types and missing values.

There seem to be missing values in the Description and CustomerID column.

Since our analysis objective is to identify customer groups, the column CustomerID is should contain unique identifiers for customers. Hence, we can drop the NA values since the data without CustomerID is not much helpful for our task.

Here is the descriptive statistics after dropping the NA values in the CustomerID column.

From the descriptive statistics, we can see that the Quantity has negative values. This could mean that either the product is returned/refunded.

Also, the maximum quantity bought is 80.9K, but if you check carefully 75% of Quantity values are less than or equal to 10. The large values in Quantity are possible here because the e-commerce platform is for wholesalers.

The maximum value of UnitPrice is 13.5K but 75% of prices are below 5. Let’s explore further the reason for these high values.

There are some odd descriptions like Manual, POSTAGE, DOTCOM POSTAGE, CRUK Commission, and Discount. Let us check what does these mean:

  1. POSTAGE/DOTCOM POSTAGE: The amount spent by the user on postage.
  2. CRUK Commission: An initiative to pay some part of the sales to the Cancer Research UK (CRUK).
  3. Manual: Since there is no proper definition we can think of this as manual service provided for the purchase of an item.
  4. Discount: This explains the discount provided for a product.

Except for the Discount, all the other categories do not directly affect the sales. Hence, we can remove those from the data.

Let’s calculate the total sales value for each transaction from the quantity and unit price.

Now, the data has been completely processed for further analysis. Let us take a look at the summary of the data.

3. Exploratory Analysis

Before jumping into the Modeling, it is always important and necessary to explore the data. This helps us to better understand the data and the business problem we are trying to solve.

Exploratory data analysis helps us to answer questions such as:

  1. What are the most purchased products in the platform?

2. People from which country are transacting more?

3. Which hour of the day, day of the week is when most transactions happen?

The transaction on the website starts to increase around 7 in the morning and peaks at noon. Then the trend slowly decreases and ends at 6 PM.
People tend to purchase more from Monday to Thursday. Surprisingly, there are no transactions that took place on Saturdays for the given period in the data.

4. What is the trend of transactions for the given period?

The monthly trend reveals that the number of people using the platform is showing an increasing trend. Further, the rate of increase stayed flat till August 2011 and rapidly increasing from September 2011.

The sudden dip in December is because we have only data till December 9th.

4. RFM Modeling

To do the RFM analysis, we need to create 3 features from the data:

  1. Recency: Latest date-Last invoice date. (Number of days since the last purchase date)
  2. Frequency: Count of invoice numbers. (Total number of transactions made by a unique customer)
  3. Monetary: Sum of Total sales. (Total value of transacted sales by each customer)

Now, let’s create a function that can be used to generate the RFM features.

The above function can be used to create RFM features for any dataset by specifying the actual names of the respective columns from the dataset you are working on. Now, creating the RFM features using this function.

To conduct RFM analysis, we need to rank the customers based on each RFM attribute separately.

Assume that we rank these customers from 1–4 using RFM values. (1-low score & 4-High score)

Steps to be followed for RFM ranking:

  1. Sort the Recency column by most recent purchases at the top. For Frequency and Monetary features sort it by the most frequent and most valuable purchases at the top.
  2. If you are using N-scale ranking to rank the customers then you have to divide the sorted values of the features into 1/N groups. Here, we are using 4-scale ranking hence we need to divide the values into 4 groups.

we can do both the sorting and grouping using the pandas df.quantile method by providing the number of quantiles as a list.

NOTE: The value of N decides the number of different RFM rank groups you want to create. All possible combinations of ranks from 1-N for all the three RFM features result in N³ rank groups ranging from 111(lowest) to NNN(highest).

In our case N=4, hence we could have a maximum of ⁴³ = 64 rank groups with scores from 111 to 444.

Now, we will create a function to give the ranks for each attribute.

We have calculated the ranks for each attribute of RFM at the customer level. We can use this to find the total number of rank groups that are created based on our ranking scale.

To do this you can simply combine all the individual R, F, and M ranks to check how many groups are created and the share of customers in each group.

For our case, the maximum number of groups should be ⁴³ = 64.

Note that the total rank groups created are 62 and this makes sense because the maximum number of groups based on our ranking scale is 64.

The reason for getting 62 instead of 64 rank groups is because there might be some missing combinations in the ranks of R, F, and M.

Finally, we can create a composite score for these customers by combining their R, F, and M ranks to arrive at an aggregated RFM score. This RFM score, displayed in the table below, is simply the average of the individual R, F, and M ranks, obtained by giving equal weights to each RFM attribute.

We can now use this score to assign Loyalty levels to each customer instead of handling the N³ rank groups. The Loyalty level will capture different behaviors of the customers and also helps in analyzing and targeting each customer group based on their behavior.

Thus we have successfully grouped 62 segments based on individual R, F, M scores into 4 broad loyalty levels. Let’s explore the characteristics of each loyalty levels.

Customer Behaviour and potential targeting techniques for each Loyalty Level:

  1. Platinum: People in this group are more frequent buyers with average days since the last purchase is 13 and the average number of times they have transacted in the platform is about 292 times in the last 1 year. Also, their average sales value is 6.5K pounds.
    These are your most loyal customers, who bought recently, most often, and are heavy spenders. Reward these customers so that they can become an early adopter for your future products and help to promote your brand.
  2. Gold: This group has an average frequency of 83 times and recency of 46 days. This group is also high spenders with average sales of about 1.3K pounds.
    These are your recent customers with an average frequency and who spent a good amount. Offer membership or loyalty programs or recommend related products to upsell them and help them become your Platinum members.
  3. Silver: People in this group have made a transaction on the platform about 87 days ago. Their frequency and monetary values are 34 times and 644 pounds respectively.
    These are your customers who purchased a decent number of times and spent good amounts but haven’t purchased recently. Sending them personalized campaigns, offers, and product recommendations will help to reconnect with them.
  4. Bronze: This is the dormant group with average days since their last purchase is 193. They have transacted around 15 times in the platform with average sales of 245 pounds.
    These are customers who used to visit and purchase in your platform but haven’t been visiting recently. Bring them back with relevant promotions, and run surveys to find out what went wrong and avoid losing them to a competitor.

5. Conclusion

RFM is a data-driven customer segmentation technique that allows marketers to make informed decisions. It empowers marketers to quickly identify and segment users into homogeneous groups and target them with differentiated and personalized marketing strategies. This in turn improves user engagement and retention.

I have used equal weightage scheme for each RFM variables in this analysis. But depending on the nature of your businesses, you can increase or decrease the relative importance of each RFM variable to arrive at the final score. For example:

  1. In a Consumer durables business, the monetary value per transaction is normally high but frequency and recency are low. For example, you can’t expect a customer to purchase a refrigerator or air conditioner every month. In this case, a marketer could give more weight to monetary and recency aspects rather than the frequency aspect.
  2. In a Retail business, customers purchase products every month or every week, so they will have a higher recency and frequency score than a monetary score. Accordingly, the RFM score could be calculated by giving more weight to R and F scores than M.
  3. For a Streaming business like Hotstar or Netflix, a binge-watcher will have a longer session length than a mainstream consumer watching at regular intervals. For bingers, engagement and frequency could be given more importance than recency, and for mainstreamers, recency and frequency can be given higher weights than engagement to arrive at the RFE score.

Thanks for reading this far. I hope this article helps you to understand the concept and process behind creating the RFM Model. You can follow these steps to create your RFM model to segment customers.

Here is the link to the full code.

Happy Learning!!