Segmentation using the RFM framework (Recency, Frequency, Monetary value)

Gaurang Mehra
4 min readMar 23, 2021

--

Taking a large customer transaction dataset and segmenting based on recency, frequency and monetary value.

Project Outline-: Using a transaction dataset from an online retailer, segment customers based on recency (how recently they ordered), frequency (how frequently they ordered) and Monetary Value (defined by their revenue). This project outlines the entire workflow from data cleaning to results interpretation

Business Need-: Segment customers to identify most consistently valuable customers.

Technique Used-: K Means clustering. Try to identify segments algorithmically versus traditional analysis creating deciles for each metric. K Means creates the most distinct clusters/segments with the most similarity within a cluster and the maximum difference between clusters

Step 1-: Exploratory Data Analysis and data clean up

Importing Modules and loading data

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import datetime
import seaborn as sns
data=pd.read_excel('onlineretail.xlsx')
data.shape

Loading the data and running the shape method shows ~541K transactions with 8 columns of data

data.head()
fig 1-: sample 5 rows of the data frame

Handle missing values

data.dropna(inplace=True)
data.info()
fig 2-: Dropping missing values

Handling Obvious Outliers

#Looking at the distribution for the sales quantity column
plt.hist(data.Quantity)
plt.show()
fig 3-: distribution of sales Quantity
Q3=data.Quantity.quantile(0.95)
Q1=data.Quantity.quantile(0.05)
print(Q1," ",Q3)
1.0 36.0
  • Looks like the sales quantity column has outliers
  • The negative quantities look like returns. For this first pass we will remove them
  • Most orders are below 15,000 or thereabouts
  • The 95 percentile is at 36 orders per invoice.
  • For this pass we will filter the dataset down to less than 1,000 orders

Step2-: Create Metrics and create the Model

Creating the Recency Metric

sales=data.loc[(data.Quantity>=0)&(data.Quantity<1000),:].copy()#Getting the first and last invoicedate
sales.sort_values(by=['CustomerID','InvoiceDate'],inplace=True)
first_sale=sales.groupby('CustomerID')['InvoiceDate'].first()
last_sale=sales.groupby('CustomerID')['InvoiceDate'].last()
#Creating merging the dataframes with the last and 1st Date
sale_date=pd.concat([first_sale,last_sale],axis=1)
#Defining the recency metric
sale_date['datecomp']=datetime.datetime.strptime('2012-01-01','%Y-%m-%d')
sale_date['recency']=(sale_date.datecomp-sale_date.last_sale).dt.days

Creating the frequency Metric

#Creating the frequency Metric
freq=sales.groupby('CustomerID')['InvoiceNo'].count()
# Adding the Frequency metric to the main dataset
sale_date['freq']=freq

Creating the Monetary value metric

#Create a Monetary Value metric
sales['Monetaryval']=sales.Quantity*sales.UnitPrice
Mval=sales.groupby('CustomerID')['Monetaryval'].sum()
#Adding the Monetary value Metric to the dataset
sale_date['Mval']=Mval
sale_date
fig 4-: Final dataset with Recency, Frequency and Monetary Value
#Selecting only the needed features
select_df=sale_date.loc[:,['recency','freq','Mval']].copy()
#Standardizing and scaling data
X=StandardScaler().fit_transform(select_df)
#Running with multiple values for clusters
ssd=[]
K=range(1,10)
for k in K:
model=KMeans(n_clusters=k)
model.fit(X)
ssd.append(model.inertia_)
#Plotting the number of clusters versus the Within Cluster Sum of Square Errors
plt.figure(figsize=(12,5))
plt.plot(K,ssd)
plt.xlabel('Number of clusters',fontsize=14)
plt.ylabel('WSSE',fontsize=14)
plt.title('Elbow plot',fontsize=14)
  • Running with multiple values of K or clusters.
  • Then drawing a plot of values of K vs the WSSE.
  • Choose values of K where the incremental gain in WSSE(measure of Cluster similarity) start flattening out
Fig 5-: Identifying optimal number of clusters
  • Obvious elbow point at K=3
  • Another elbow point as K=4 (not so obvious)
#Running the model with 4 clusters
model=KMeans(n_clusters=4)
model.fit(X)
#Adding the cluster labels to the dataset
select_df['labels']=model.labels_
#Lets look at the clusters
select_df.groupby('labels')['CustomerID','freq','Mval','recency'].agg({'CustomerID':'count','freq':'mean','Mval':'mean','recency':'mean'})
Fig 6-: Making sense of the clusters

Cluster 3 orders most freq, has the highest value and is the most recent →Platinum

  • Cluster 2 orders little less frequently, slightly lower value than 3,little less recent →Gold
  • Cluster 0 lower than 2 and 3 on all 3 metrics → Silver
  • Cluster 1 very infrequent shoppers → Bronze

Final Named Clusters

Fig 7-: Named clusters

Step 3-: Possible Business outcomes

  • Target premium offers to Gold and Platinum customers. Market new product offerings in exclusive launches to this segment. Target platinum and Gold customers for in store credit cards
  • Use offers to increase the recency for the Silver customers. This segment has the highest number of customers, increasing frequency here will have a big impact to the bottomline. Trageted email campaign

--

--

Gaurang Mehra

Deeply interested in Data Science, AI and using these tools to solve business problems.