Customer Segmentation with RFM

On a Real-World Business Problem

8 min readMar 3, 2023

Firstly, what is RFM?

RFM is a key indicator consists of three metrics to understand customers’ current relationships with the companies. RFM analysis is a widely used, data-driven, rule-based technique used for customer segmentation with Recency, Frequency, Monetary metrics by utilizing the customer data.

RFM analysis is performed to better meet the needs of different kinds of customers separately. Customers are grouped according to their purchasing habits so that unique business strategies can be developed for these groups. Individual treatments can prevent customer churn and ensure loyal customers to stay that way.

Three key factors:

Recency refers to length of the time of the last interaction with the customer. The smallest recency value corresponds to latest purchase; so the smaller the better. More recent customers are more valuable.

Recency = Date of the analysis — Date of last purchase

Frequency refers to how often a customer makes a purchase. Frequent customers are more valuable.

Monetary refers to the financial value that each customer spent on this company.

How to compare these metrics?

These metrics are converted into scores so that they are of the same type and can be comparable with each other. This conversion also can be called standardization. Scores coming from the metrics are attached to each other and form RFM Score for each individual customer. Mainly customers are ranked with respect to the value they bring to the company from 1 to 5.

RFM Scores could be on a wide range, therefore a method is applied on RFM Scores to form segments so that these segments have meaningful insights.

Segmentation through RFM scores:

Segmentation is commonly performed through 2-D graph shown in Table 1. This way, RFM scores make sense. RFM Analysis can help companies to know their customers and improve their marketing strategies accordingly.

As seen Monetary score is not on the graph since in CRM analytics frequency and recency is more significant than the monetary. When you think about it, analyzing monetary of a customer with few transactions are not that important compared to frequent customers. While the monetary value is not explicitly represented in the graph, it is still an important component of the RFM analysis and used to make assumptions.

Solving a business problem through RFM:

Let’s see how these concepts are applied on a real-world dataset, what insights can be extracted from the data. You can reach the full code in my GitHub Profile for better demonstration.

During a data analysis, there are certain steps to be followed.

1.Business Problem

An UK based e-commerce company wants to divide its customers into segments and define marketing strategies according to these segments. A dataset consisting of the sales between 01/12/2009–09/12/2011 exists.

2.Understanding the Data

Dataset includes 8 variables namely Invoice, StockCode, Description, Quantity, InvoiceDate, Price, Customer ID, and Country. Invoices that start with “C” are for cancelled transactions. To understand what we are up against we need to take a look at the data. Analysis will be done for years 2010–2011, however it can be used for years 2009–2010 too since the process will be generalized at the end.

Primary adjustments (importing the libraries and setting data display preferences):

import datetime as dt
import pandas as pd
pd.set_option('display.max_columns', None) #to see all the columns
pd.set_option('display.float_format', lambda x: '%.3f' % x) #to have 3 decimal points

Reading the data:

df_ = pd.read_excel("excel-file-destination.xlsx", sheet_name="Year 2010-2011")
df = df_.copy() #It takes time to read the file,thus a copy is taken after reading the file
                #so we do not have to read the file and have to wait repeatedly thoughout the analysis

Let’s take a glance at the dataset:

df.head() #unless specified otherwise, returns the first 5 rows of the dataset
df.shape  #returns the number of the ows and columns
df.isnull().sum() #returns the number of null values in every column

Now, we have an idea what the dataset is about. It seems that Invoice column values are not unique. This makes sense since most purchases have multiple items and data has a row for every item even in the same purchase.

A large dataset at issue with 541910 rows and 8 columns. But should we consider all the data for the analysis?

We have missing values in mostly Customer ID and Description variables. This means we will have to delete the missing Customer ID rows since the analysis is customer specific, missing data gives no information.

Let’s answer some questions:

# Number of unique products
df["Description"].nunique()

# Number of products appearing in invoices()
df["Description"].value_counts().head()

# Most ordered product (including quantities)
df.groupby("Description").agg({"Quantity": "sum"}).sort_values("Quantity", ascending=False).head()

# Number of unique invoices
df["Invoice"].nunique()

# Add a new column for TotalPrice
df["TotalPrice"] = df["Quantity"] * df["Price"]

# Find the total cost in every invoice
df.groupby("Invoice").agg({"TotalPrice": "sum"}).head()
# needed for monetary metric

3.Data Preparation:

Data preparation plays an important role on analysis. Data need to be manipulated into a way that inferences can be made. The problems on this dataset and the solutions are explained below on code.

df.shape # To observe the reducing number of rows as we make adjustments
df.isnull().sum()  # Observed problem: Missing Customer ID data, needed to be erased
df.describe().T  # Observed problem: Negative Quantitity values, no meaning, needed to be erased

df = df[(df['Quantity'] > 0)] # To solve Negative Quantitity values problem

df.dropna(inplace=True)
# To solve Missing Customer ID data problem
# Inplace makes the changes on dataset immediately without a need for assigning

df["Invoice"] = df["Invoice"].astype(str)
df = df[~df["Invoice"].str.contains("C", na=False)]
# Cancelled invoices do not need to be in the analysis, so they are erased.
# "~" means except this:

As seen, row number decreased considerably after the data preparation and TotalPrice column is added.

4.Calculating RFM Metrics

Until now, we basically prepared the data so that we can perform RFM analysis on it. Now, let’s start the analysis with calculating the metrics. Since we do not live in 2010–2011 anymore, we need to establish a date which analysis is performed on. For this purpose, let’s find the latest date and decide a day near it to perform the analysis.

Then, operations are done on InvoiceDate, Invoice, and TotalPrice columns and grouped by Customer ID to obtain RFM data frame containing Recency, Frequency, Monetary metrics respectively.

df["InvoiceDate"].max()  # latest date on data

today_date = dt.datetime(2011, 12, 11)
# added 2 days to last day to form today's date
# utilized datetime library so that today_date's datatype is datetime

type(today_date)

rfm = df.groupby('Customer ID').agg({'InvoiceDate': lambda InvoiceDate: (today_date - InvoiceDate.max()).days,
                                     'Invoice': lambda Invoice: Invoice.nunique(),
                                     'TotalPrice': lambda TotalPrice: TotalPrice.sum()})

# In order, Recency, Frequency, Monetary are found.
# Recency is the difference in the today's date and last purchase on Customer ID level.
# Frequency is the number of unique invoices that belongs to the same customer.
# Monetary is summation of total prices of the same customer, so we know how much each customer brings to the company.

rfm.head()
# check the data
rfm.columns = ['recency', 'frequency', 'monetary']
# change the column names

rfm.describe().T # to see if there is an unsual situation

rfm = rfm[rfm["monetary"] > 0]
# min monetary = 0 is unusual, so they are excluded.

rfm.shape

Now, we have transformed the existing information from invoice table to customer-based table. We can see the time after last purchase, frequency of purchases, and total spending of each customer. For instance, Customer 12346’s last purchase was 326 days ago, this customer only made 1 purchase and total money he spent on this company is 77183.6 GBP. Now, we have 4338 customers and their singularized information to focus on.

5.Calculating RFM Scores:

While calculating RFM Scores, first thing to consider is that recency scores are assigned backwards compared to frequency and monetary because of their meanings.

rfm["recency_score"] = pd.qcut(rfm['recency'], 5, labels=[5, 4, 3, 2, 1])
# recency metrics divided 5 equal parts acoording to their sizes and
# labelled such a way that greatest recency got 1, the smallest recency got 5.

rfm["frequency_score"] = pd.qcut(rfm['frequency'].rank(method="first"), 5, labels=[1, 2, 3, 4, 5])
# the problem in frequency is some numbers are too repetitive that
# qcut can not label the same frequency number diffently
# rank method solves this problem by assigning the first comer number to first label.

rfm["monetary_score"] = pd.qcut(rfm['monetary'], 5, labels=[1, 2, 3, 4, 5])

rfm["RFM_SCORE"] = (rfm['recency_score'].astype(str) +
                    rfm['frequency_score'].astype(str))
# by RFM definition, string is created with recency and frequency score
# and formed final RFM Score
# monetary score is necessary for observation, but it is not used in forming RFM Score

rfm.describe().T
# newly created scores did not apper with describe function
# since they are strings and not integers

rfm[rfm["RFM_SCORE"] == "55"]
# champions

rfm[rfm["RFM_SCORE"] == "11"]
# hibernating

6.Creating and Analysing RFM Segments:

In order to increase the legibility of the table, we are going to add what each segment means to us according to Table 1.

RegEx (Regular expressions) contains matches that conform to a certain pattern within a series. This is the part that we extract insights from the data.

# regex
# RFM Naming (Pattern Matching)
seg_map = {
    r'[1-2][1-2]': 'hibernating',
    r'[1-2][3-4]': 'at_Risk',
    r'[1-2]5': 'cant_loose',
    r'3[1-2]': 'about_to_sleep',
    r'33': 'need_attention',
    r'[3-4][4-5]': 'loyal_customers',
    r'41': 'promising',
    r'51': 'new_customers',
    r'[4-5][2-3]': 'potential_loyalists',
    r'5[4-5]': 'champions'
}
#  Basically the R-F table is coded using regex.

rfm['segment'] = rfm['RFM_SCORE'].replace(seg_map, regex=True)
segment_analysis = rfm[["segment", "recency", "frequency", "monetary"]].groupby("segment").agg(["mean", "count"])


rfm[rfm["segment"] == "cant_loose"].head()
cant_loose_customers = rfm[rfm["segment"] == "cant_loose"].index
#  A list of customers which the company should be more careful with is formed.
#  This way, every segment of customers is accesible. 

new_df = pd.DataFrame()
#  Empty dataframe is formed

new_df["new_customer_id"] = rfm[rfm["segment"] == "new_customers"].index
new_df["new_customer_id"] = new_df["new_customer_id"].astype(int)
#  To get rid of the decimals

new_df.to_csv("new_customers.csv")
rfm.to_csv("rfm.csv")
#  Segment information is extracted into an excel file.

7.Functionalization:

According to basic programming principles such as DRY (Don’t Repeat Yourself), functionalization of this entire process is a must. Thus, we can use this function, hence the RFM Analysis, on different data frames repeatedly.

Clean version of all the data above written in function below in the name of create_rfm with the option for creating a csv file.

def create_rfm(dataframe, csv=False):

Since this part is the repetition of the code, to take a look at it, you can visit my GitHub Profile where I will be posting regularly from now on.

P.S. Thank you a lot for devoting your time to reading all of this. I share as I learn, so new data science related article will be coming every week. Hope this post helps you too. Please don’t hesitate to leave a comment :) Best of luck on your own data science journey.

Connect with me on LinkedIn

Connect with me on Medium

Reference:

Miuul Lecture Videos