Segmentation using the RFM framework (Recency, Frequency, Monetary value)

4 min readMar 23, 2021

Taking a large customer transaction dataset and segmenting based on recency, frequency and monetary value.

Project Outline-: Using a transaction dataset from an online retailer, segment customers based on recency (how recently they ordered), frequency (how frequently they ordered) and Monetary Value (defined by their revenue). This project outlines the entire workflow from data cleaning to results interpretation

Business Need-: Segment customers to identify most consistently valuable customers.

Technique Used-: K Means clustering. Try to identify segments algorithmically versus traditional analysis creating deciles for each metric. K Means creates the most distinct clusters/segments with the most similarity within a cluster and the maximum difference between clusters

Step 1-: Exploratory Data Analysis and data clean up

Importing Modules and loading data

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import datetime
import seaborn as snsdata=pd.read_excel('onlineretail.xlsx')
data.shape

Loading the data and running the shape method shows ~541K transactions with 8 columns of data

data.head()

Handle missing values

data.dropna(inplace=True)
data.info()

Handling Obvious Outliers

#Looking at the distribution for the sales quantity column
plt.hist(data.Quantity)
plt.show()

Q3=data.Quantity.quantile(0.95)
Q1=data.Quantity.quantile(0.05)
print(Q1," ",Q3)
1.0 36.0

Looks like the sales quantity column has outliers
The negative quantities look like returns. For this first pass we will remove them
Most orders are below 15,000 or thereabouts
The 95 percentile is at 36 orders per invoice.
For this pass we will filter the dataset down to less than 1,000 orders

Step2-: Create Metrics and create the Model

Creating the Recency Metric

sales=data.loc[(data.Quantity>=0)&(data.Quantity<1000),:].copy()#Getting the first and last invoicedate
sales.sort_values(by=['CustomerID','InvoiceDate'],inplace=True)
first_sale=sales.groupby('CustomerID')['InvoiceDate'].first()
last_sale=sales.groupby('CustomerID')['InvoiceDate'].last()#Creating merging the dataframes with the last and 1st Date
sale_date=pd.concat([first_sale,last_sale],axis=1)#Defining the recency metric 
sale_date['datecomp']=datetime.datetime.strptime('2012-01-01','%Y-%m-%d')sale_date['recency']=(sale_date.datecomp-sale_date.last_sale).dt.days

Creating the frequency Metric

#Creating the frequency Metric
freq=sales.groupby('CustomerID')['InvoiceNo'].count()# Adding the Frequency metric to the main dataset
sale_date['freq']=freq

Creating the Monetary value metric

#Create a Monetary Value metric
sales['Monetaryval']=sales.Quantity*sales.UnitPrice
Mval=sales.groupby('CustomerID')['Monetaryval'].sum()#Adding the Monetary value Metric to the dataset
sale_date['Mval']=Mval
sale_date

fig 4-: Final dataset with Recency, Frequency and Monetary Value

#Selecting only the needed features
select_df=sale_date.loc[:,['recency','freq','Mval']].copy()#Standardizing and scaling data
X=StandardScaler().fit_transform(select_df)#Running with multiple values for clusters
ssd=[]
K=range(1,10)
for k in K:
    model=KMeans(n_clusters=k)
    model.fit(X)
    ssd.append(model.inertia_)#Plotting the number of clusters versus the Within Cluster Sum of Square Errors
plt.figure(figsize=(12,5))
plt.plot(K,ssd)
plt.xlabel('Number of clusters',fontsize=14)
plt.ylabel('WSSE',fontsize=14)
plt.title('Elbow plot',fontsize=14)

Running with multiple values of K or clusters.
Then drawing a plot of values of K vs the WSSE.
Choose values of K where the incremental gain in WSSE(measure of Cluster similarity) start flattening out

Fig 5-: Identifying optimal number of clusters

Obvious elbow point at K=3
Another elbow point as K=4 (not so obvious)

#Running the model with 4 clusters
model=KMeans(n_clusters=4)
model.fit(X)#Adding the cluster labels to the dataset
select_df['labels']=model.labels_#Lets look at the clusters
select_df.groupby('labels')['CustomerID','freq','Mval','recency'].agg({'CustomerID':'count','freq':'mean','Mval':'mean','recency':'mean'})

Cluster 3 orders most freq, has the highest value and is the most recent →Platinum

Cluster 2 orders little less frequently, slightly lower value than 3,little less recent →Gold
Cluster 0 lower than 2 and 3 on all 3 metrics → Silver
Cluster 1 very infrequent shoppers → Bronze

Final Named Clusters

Step 3-: Possible Business outcomes

Target premium offers to Gold and Platinum customers. Market new product offerings in exclusive launches to this segment. Target platinum and Gold customers for in store credit cards
Use offers to increase the recency for the Silver customers. This segment has the highest number of customers, increasing frequency here will have a big impact to the bottomline. Trageted email campaign

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

Segmentation using the RFM framework (Recency, Frequency, Monetary value)

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Gaurang Mehra