RFM Analysis for Customer Segmentation

Oğuzcan Özkara
turkcell
Published in
6 min readJan 2, 2022

CRM (Customer Relationship Management) is one of the acronyms we have heard the most in the business world for the last 20 years. The aim of CRM is to maximize customer satisfaction by responding to customer expectations at the optimum level. Nowadays, the meaning of these concepts has expanded even more. Especially in the marketing field, many companies segment their customers, displaying the most appropriate behavior towards them and making recommendations. While doing this, they apply some data science algorithms to the data they have.

RFM is one of the techniques used for CRM and customer segmentation. It enables customers to be divided into groups based on their purchasing behaviors and companies to develop strategies specific to these groups.

https://clevertap.com/blog/rfm-analysis/
  • Today, we will be focussing on a business problem which is about retail data.

Data Set Information:

This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.

Attribute / Variable Information:

Invoice : There is a unique number for each invoice. If the invoice number starts with the letter C it stands for canceled transactions
StockCode: There is a unique number for each product.
Description: Product name
Quantity: Indicates how many products were sold.
InvoiceDate: Invoice date
Price: Product unit price (Pound)
Customer ID: Unique customer number
Country: Name of the country where the customer lives

You can access the dataset from the link below. https://archive.ics.uci.edu/ml/datasets/Online+Retail+II

1. Data Preprocessing

There are 2 different sheets in online_retail_II.xlsx file . We will use “Year 2010–2011".

1.a Understanding and preparing data

1.b Examine descriptive statistics of data

import datetime as dt
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.5f' % x)
## Loading data
df_= pd.read_excel("Datasets/online_retail_II.xlsx", sheet_name= "Year 2010-2011"

## 1.a Storing the raw data in a different dataframe is always reasonable in case of losing any data.
df=df_.copy()# 1.b Examine descriptive statistics of datadf.head()df.shapedf.describe().Tdf.columns

1.c Are there any missing observations in the dataset? If so, how many missing observations are there in each variable?

df.isnull().sum() ## 135080 records for "Customer ID"  and  1454 records for "Description"

1.d Excluding missing observations from the dataset

Use "inplace =True" for making our changes permanent. When you examine the dataset deeply, you will notice that there are some bad data for our RFM analysis.

Firstly, we have to be aware of negative values in Quantity and Price columns. The second task is to exclude canceled transactions from the data frame because those records are not useful for our RFM analysis.

1.e Have a look at the number of unique products

1.f Let’s see how many of each product there are

1.e and 1.f

1.g Let’s list the 5 most ordered products in descending order

df.pivot_table (values="Quantity" , index="StockCode", aggfunc="count").sort_values("Quantity", ascending=False).head(5) 
# Also 3rd way for 1.f :)
Result of 1.g

1.h Create a variable named ‘Total Price’ that represents the total earnings per invoice in order to use in upcoming processes of RFM Analysis

df["TotalPrice"] = df["Quantity"] * df["Price"]

2. CALCULATING RFM METRICS

Each metric in RFM will be calculated specifically for each customer. Let’s define Recency, Frequency, and Monetary metrics that milestones for RFM Analysis process.

Recency => How recent was the customer's last purchase? Number of days between analysis date and last invoice date.

Frequency=> How often did this customer make a purchase in a given period? Number of purchases.

Monetary => How much money did the customer spend in a given period? In other words, how much money will the company potentially earn from a particular customer?

For recency, we should know the newest date of the invoice in our dataset. According to that date, we can specify the analysis date which will be used for the next steps of our analysis.

The maximum date is 2011-12-09 , Instead of using today's date, we decide the most rational date, so I choose 2 days later of maximum date.

df[“InvoiceDate”].max()
today_date = dt.datetime(2011, 12, 11)

Now, we are going to create a RFM variable to keep all of the metrics together.

rfm = df.groupby('Customer ID').agg({'InvoiceDate': lambda InvoiceDate: (today_date - InvoiceDate.max()).days,
'Invoice': lambda Invoice: Invoice.nunique(),
'TotalPrice': lambda TotalPrice: TotalPrice.sum()})


## Control if there is any bad data.
rfm=rfm[rfm["TotalPrice"] > 0]

## We made our column names suitable for rfm analysis
rfm.columns = ['recency', 'frequency', 'monetary']

rfm.head()

3. CALCULATING RFM SCORES

In this part of my article, we will be calculating scores and standardizing RFM metrics we formed last part. qcut is a Quantile-based discretization function. It divides data into equal-sized bins. Let’s add new columns to data frame to keep RFM scores.

If a customer has made a recent purchase, it is more valuable and should have a higher score. So there is an inverse proportional relationship between frequency score and label.

rfm["recency_score"]= pd.qcut(rfm['recency'], 5 ,labels=[5,4,3,2,1])

We use rank method below, why? I answer your question with one example. Assume that we have a variable consisting of a hundred piece "10" value. Even if you rank/quantile that dataset, you will possibly see "10" values in more than one different quantiles. That's why we use the rank method because we want the distribution to be correct.

In this business problem, we’ll be focussing on recency and frequency scores and we stick to RFM Segmentation Heatmap/Graph below.

RFM Heatmap/ Graph

As you may think, the highest customer score might be 55, and the lowest is 11.

rfm["RFM_SCORE"] = (rfm['recency_score'].astype(str) +  rfm['frequency_score'].astype(str))

rfm[rfm["RFM_SCORE"] == "55"].head()
The highest RFM Scores

4. CREATING & ANALYSİS RFM SEGMENTS

We formed/mapped our RFM Segmentation heatmap/graph in code blocks via using Regex.

# Naming / Mapping RFM Segmentations
seg_map = {
r'[1-2][1-2]': 'hibernating',
r'[1-2][3-4]': 'at_Risk',
r'[1-2]5': 'cant_loose',
r'3[1-2]': 'about_to_sleep',
r'33': 'need_attention',
r'[3-4][4-5]': 'loyal_customers',
r'41': 'promising',
r'51': 'new_customers',
r'[4-5][2-3]': 'potential_loyalists',
r'5[4-5]': 'champions'
}
rfm['segment'] = rfm['RFM_SCORE'].replace(seg_map, regex=True)

5. Result

To be honest, the rest is up to the request that comes from manager/business teams. If you want you can set-apart “champions” segment then you can send them a congratulatory address or evaluate your “hibernating” customers separately. Based on requirements is changeable.

As you can see, all segments(10) are listed by grouping below. Their statistical data indicates that they have different features.

rfm[["segment", "recency", "frequency", "monetary"]].groupby("segment").agg(["mean", "count"])

In addition to these, we can see the number of customers for each segment from the output above. The hibernating segment is at the top of the list in the aspect of segment distribution. Show this with a pie chart.

  • Additionally, we can separate all segments from each other and store them in different files. As an example, let’s try extracting customer ids in just the “new customers” segment from the data frame in order to load into a new CSV file named new_customers.csv

You can access all the equipment and code on Github from the link below.

https://github.com/oguzcan81/RFM-Analysis

--

--