RFM Analysis for Customer Segmentation

Harshal Kakaiya
9 min readNov 28, 2022

--

A Step by Step approach to building an RFM model for Customer Segmentation in Python.

RFM Metrics for Customer Segmentation (Photo by Odorite.com)

Companies have spent a lot of money on market research, but with technology changing customers’ behavior and research methodologies all the time, there is a need for constant improvements. The Marketing team has long recognized the significance of customer orientation, since knowing, serving, and influencing consumers is critical to accomplishing marketing goals and objectives.

What is RFM Segmentation?

RFM segmentation is a method that helps you identify the most important types of customers by grouping them and giving scores to their Recency, Frequency, and monetary values.

  1. Recency: How much time has elapsed since a customer’s last activity or transaction with the brand?
  2. Frequency: How often has the customer transacted or interacted with the brand during a particular period of time?
  3. Monetary: How much a customer has spent with the brand during a particular period of time?

RFM segmentation enables marketers to target specific groups of consumers with communications that are far more relevant to their individual behaviors, resulting in much greater response rates and improved loyalty and customer lifetime value. RFM segmentation, like other segmentation approaches, is an effective tool to identify groups of consumers who should be treated differently. RFM stands for recency, frequency, and monetary.

There are several approaches to segmentation. However, I chose RFM Model for the following reasons:

  1. It employs objective numerical scales to produce a high-level picture of consumers that is both succinct and instructive.
  2. It’s simple enough that marketers can utilize it without expensive tools.
  3. It’s simple — the segmentation method’s output is simple to comprehend and analyze.

Basis

For this project, I will be building an RFM (Recency Frequency Monetary) model using a Customer Invoices dataset I downloaded on Kaggle just for the sake of this project (I know someone must have put it out there for free use, a big thank you to the anonymous). I am sure there are countless free data sets you can get on Kaggle for practice as well.

Intended Outcome

The purpose of this project is to build an RFM model that segments customers into sections sorted by how much they contribute from up to down listed below:

  • Champion Customer: bought recently, buy often and spends the most
  • Loyal/Committed: spend good money and often, responsive to promotions
  • Potential: recent customers, but spent a good amount and bought more than once
  • Promising: recent shoppers, but haven’t spent much
  • Requires Attention: above average recency, frequency, and monetary values; may not have bought very recently though
  • Demands Activation: below average recency, frequency, and monetary values; will lose them if not reactivated
  • Can’t Lose them: made biggest purchases, and often but haven’t returned for a long time

So, Let’s get started right away:

Step 1: Importing Required Libraries for RFM Segmentation

# Importing Required Libraries

import pandas as pd
import numpy as np
from datetime import timedelta
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

Step 2: Explorative Data Analysis (EDA)

I consider this step sacred and important in all data science projects. Performing a detailed EDA helps you understand your data and know the best approach to tackling any project. You will get to know the missing values, and correlating features, and identify other trends present in the data set. Below is what the Invoices dataset looks like:

Invoices Dataset
Pandas DataFrame Showing Invoices Dataset

Now that we have our data in a suitable environment, It’s often a great idea to take a look at the first samples (just to describe what our data looks like). This dataset is used to analyze merchant behavior. Here are a few details about the features:

  • Invoice: This is a unique number generated by this FMCG store to help trace payment details.
  • StockCode: This is a unique number assigned to each product in a particular category to help in stock-keeping/tracking purposes.
  • Description: This explains the product’s why and provides information about the products.
  • Quantity: This gives the number of products purchased.
  • InvoiceDate: This represents the time stamp (time and date) on which the invoice has been billed and the transaction officially recorded.
  • Price: This refers to the price of each product.
  • CustomerID: This refers to the unique number assigned to each customer.
  • Country: This refers to the country in which the purchase is being made.

One question that should come to mind is “What is the unique identifier of each row in the data?” A unique identifier can be a column or set of columns that is guaranteed to be unique across rows in your dataset. This is key for differentiating rows and referencing them in our EDA. For this Invoices data set, we will be using CustomerID as the key for this Project.

Step 3: Data Preprocessing

Here, we have our data ready and will be performing some basic pre-processing on the data sets:

  • First, we’ll be using the Python Script below to convert the InvoiceDate Feature from Object format to DateTime format.
# Converting InvoiceDate from object to datetime format

invoices_data['InvoiceDate'] = pd.to_datetime(invoices_data['InvoiceDate'])
  • Now, we will drop Not Available Values present in DataFrame using Python Scrip below:
# Drop NA Values

invoices_data.dropna(inplace=True)
  • Now, when we generate descriptive statistics of Dataset we have the following information:
# Generate descriptive stats of dataset

invoices_data.describe()
Descriptive Stats for Invoices DataFrame
  • In the above picture, we can see customers have ordered in a negative value which cannot be possible so we need to filter quantity >0 using Python Script below:
# Filter Required column for RFM Analysis

filter = (invoices_data.Quantity>0)
invoices_data = invoices_data[filter]
Descriptive Stats after Filtering Quantity
  • We create a new column TotalSum column with the Python Script below:
# Creating TotalSum column for Invoices dataset

invoices_data['TotalSum']= invoices_data['Quantity']*invoices_data['Price']
  • We then create a snapshot of the date, with the Python Script below:
# Create snapshot date

snapshot_date = invoices_data['InvoiceDate'].max() + timedelta(days=1)
print(snapshot_date)
  • Now, we drop the records that are Returned items indicated with C by filtering
# Drop the returned items Records

invoices_data= invoices_data[~invoices_data['StockCode'].str.contains('C')]
  • We can group customers by CustomerID after creating the snapshot date using the python script below:
# Grouping by CustomerID

invoices_rfm = invoices_data.groupby(['Customer ID']).agg({
'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
'Invoice': 'count',
'TotalSum': 'sum'})
Invoices Dataset after Aggregating Fields

We proceed to rename our features — columns (InvoiceDate, InvoiceNo, TotalSum) with Recency, Frequency, and Monetary respectively but just before that, let's define some terms:

  1. Recency: The more recently a customer has interacted or transacted with a brand. How long has it been since a customer engaged in an activity or made a purchase with the brand? The most common activity is a purchase for an FMCG store, though other examples include the most recent visit to a website or the use of a mobile app for other scenarios/industries.
  2. Frequency: During a given time period, how many times has a consumer transacted or interacted with the brand? Customers who participate in activities regularly are clearly more involved and loyal than those who do so infrequently. it answers the question, of how often?
  3. Monetary: This factor, also known as “monetary value,” reflects how much a customer has spent with the brand over a given period of time. Those who spend a lot of money should be handled differently from customers who spend little. The average purchase amount is calculated by dividing monetary by frequency, which is a significant secondary element to consider when segmenting customers.

Now we can relate the relationship between (InvoiceDate & Recency, InvoiceNo & Frequency, and TotalSum & Monetary).

  • Here, is a Python Script to rename the columns:
invoices_rfm.columns = ['Recency', 'Frequency', 'Monetary']
invoices_rfm.head()
RFM Dataframe after Renaming Field
  • We can plot the distribution using the Python Script below:
# Plot RFM distributions
plt.figure(figsize=(12,10))

# Plot distribution of R
plt.subplot(3, 1, 1); sns.distplot(invoices_rfm['Recency'])

# Plot distribution of F
plt.subplot(3, 1, 2);
sns.distplot(invoices_rfm['Frequency

# Plot distribution of M
plt.subplot(3, 1, 3); sns.distplot(invoices_rfm['Monetary'])

# Show the plot
plt.show()
Plot Distribution of Recency, Frequency, Monetary Value

Step 4: Building the RFM Groups

  • We’ll be Calculating the R, F, and M groups.
  • Creating labels for Recency, Frequency, and Monetary Value,
  • Assigning labels created to 4 equal percentile groups,
  • Then create new columns R, F, and M.

Here, is the python script to create the RFM Groups below:

R_labels, F_labels, M_labels = range(5,0,-1),range(1,6),range(1,6)

invoices_rfm['R'] = pd.qcut(invoices_rfm['Recency'],q=5,labels=R_labels)
invoices_rfm['F'] = pd.qcut(invoices_rfm['Frequency'],q=5,labels=F_labels)
invoices_rfm['M'] = pd.qcut(invoices_rfm['Monetary'],q=5,labels=M_labels)

invoices_rfm.head()
Pandas Data frame Showing the Calculated R, F, and M groups of the data frame

Step 5: Building the RFM Model

  • We have to concatenate the RFM quartile values to create RFM segments using the python scripts below:
# Concating the RFM quartile values to create RFM Segments

def concat_rfm(x): return str(x['R']) + str(x['F']) + str(x['M'])

invoices_rfm['RFM_Concat'] = invoices_rfm.apply(concat_rfm, axis=1)
invoices_rfm.head()
Pandas Data frame Showing the Created RFM Segments of the data frame
  • Now let’s count the number of unique segments
  • Then Calculate the RFM score with the python scripts below.
# Count num of unique segments
rfm_count_unique=invoices_rfm.groupby('RFM_Concat')['RFM_Concat'].nunique()
rfm_count_unique.sum()

# Calculate RFM_Score
invoices_rfm['RFM_Score'] = invoices_rfm[['R','F','M']].sum(axis=1)
invoices_rfm.head()
Pandas Data frame Showing the Calculated RFM score of each customer in the data frame
  • Then we create a conditional Statement using the python scripts below to segment Customers (by CustomerID column) as one of the segments: “Can’t Lose Them”, “Champions”, “Loyal/Committed”, “Potential”, “Promising”, “Requires attention”, or “Demands Activation”:
# Define invoices_rfm_level function

def invoices_rfm_level(df):
if df['RFM_Score'] >= 9:
return 'Can\'t Loose Them'
elif ((df['RFM_Score'] >= 8) and (df['RFM_Score'] < 9)):
return 'Champions'
elif ((df['RFM_Score'] >= 7) and (df['RFM_Score'] < 8)):
return 'Loyal/Commited'
elif ((df['RFM_Score'] >= 6) and (df['RFM_Score'] < 7)):
return 'Potential'
elif ((df['RFM_Score'] >= 5) and (df['RFM_Score'] < 6)):
return 'Promising'
elif ((df['RFM_Score'] >= 4) and (df['RFM_Score'] < 5)):
return 'Requires Attention'
else:
return 'Demands Activation'

# Create a new variable RFM_Level
invoices_rfm['RFM_Segment']= invoices_rfm.apply(invoices_rfm_level, axis=1)

# Printing the header with top 15 rows
invoices_rfm.head(15)
  • We have a Pandas Data frame Showing the Calculated RFM Segment of each customer in the data frame below:
Pandas Data frame Showing the Calculated RFM Level of each customer in the data frame
  • Calculating the average values for each RFM Segment, and return the size of each segment using the python script below:
# Calculate average values for each RFM_Level, 
# and return a size of each segment

rfm_segment_agg = invoices_rfm.groupby('RFM_Segment').agg({
'Recency': 'mean',
'Frequency': 'mean',
'Monetary': ['mean', 'count']
}).round(1)

# Print the aggregated dataset
rfm_segment_agg
  • We have a Pandas Data frame Showing the Calculated values for each RFM_Segment of each customer in the data frame below:
Pandas Data frame Showing the Calculated values for each RFM_Level of each customer in the data frame

Step 6: Data Visualization of Customers Segmented Using the RFM Model

  • Plotting the RFM Segment on the Bar plot using the Python Script below:
plt.bar(x=rfm_segment_agg.index,h=rfm_segment_agg["Monetary"]["count"])
plt.xticks(rotation=90)
Bar plot Representing the count of each Segment
  • Squarify library: I chose Squarify because, squarify library is built on top of Matplotlib, and it uses space efficiently.
  • Plotting the RFM level on the Squarify plot using the Python Script below:
rfm_segment_agg.columns = ['RecencyMean','FrequencyMean',
'MonetaryMean', 'Count']

#Create our plot and resize it.
fig = plt.gcf()
ax = fig.add_subplot()
fig.set_size_inches(16, 9)
squarify.plot(sizes=rfm_segment_agg['Count'],
label=['Can\'t Loose Them',
'Champions',
'Loyal/Commited',
'Requires Attention',
'Potential',
'Promising',
'Demands Activation'], alpha=.6 )

plt.title("RFM Segments by Count")
plt.axis('off')
plt.show()
A Squarify Plot of Customer RFM Segmentation

Conclusion:

We have segmented our Customers based on RFM Scores. Recency, Frequency, and Monetary_Value.with the help of RFM analysis, We can create different marketing approaches for a certain group of customers.

RFM analysis helps us find answers to the following questions:

  • Who are your best customers?
  • Which of your customers could contribute to your churn rate?
  • Who has the potential to become valuable customers?
  • Which of your customers can be retained?
  • Which of your customers are most likely to respond to engagement campaigns?

Remarks

Thank You for Reading my Article. To see the complete python code written on Jupyter Notebook, Github, portfolio, and my Social Media pages. Kindly use the links below:

--

--

Harshal Kakaiya

Machine Learning 🤖 | Data Science 🔬 | Data Analytics 📈