Understand Your Customer Behaviour with Customer Segmentation Methods

A Friendly Introduction to RFM Analysis

Published in

Data Folks Indonesia

7 min readOct 28, 2021

Image by **Daniel Bernard** on **Unsplash**

Understanding customer behaviour would help seller or business to have a strong relationship with customers and result in new sales. Once the seller know how the customers decision process and action when purchasing goods and services, the seller can be more focus on the marketing efforts for current products and new product launches.

To understand their behaviour, we can use customer segmentation. This technique will divide customers into several groups based on their characteristics. Here, I will talk about RFM as this method will use customer behaviour such as how much the customer’s spending on the products, how often the customer buy product, and how long it is since the last purchase the products.

What is RFM?

RFM stands for Recency, Frequency, and Monetary value. This method use historical transactions data to get these three values. To get the information on how the customer buying products, we need to start extract the three indicators of RFM:

Recency — how long customers purchase the last products. It is calculated from a given date
Frequency — how much product customer purchased
Monetary — how much money customers spend on purchasing the products

How to build RFM method?

After we got each value for RFM, we can use it as rank for each customer to determine what their segmentation is. To do this, we need to do several steps:

1. Calculate the percentile values for each customer

For example, customer A is in the 93rd percentile of frequency.

2. Compare these to the overall percentiles

For instance:

If the quantile is under 45th, F score is 1.
If the quantile is from 45th to 80th, F score is 2.
If the quantile is above 80th, F score is 3.

Since customer A is above the 80th percentile of frequency, they receive an F score of 3.

3. Combine the three fields (R, F, and M)

We can get overall RFM score by combining the fields.

E.g:

Customer A receive F score of 3, R score of 2, and M score of 1. The RFM score for Customer A is 231.

Let’s jump to the codes!

Import Libraries & Data

import pandas as pd
import time
import requests
from io import StringIO
import seaborn as sns
import matplotlib.pyplot as plt# read data from URL
url = "https://raw.githubusercontent.com/alifiaharmd/ML-DLPlayground/main/Dataset/Retail_Data_Transactions.csv"data = pd.read_csv(StringIO(requests.get(url).text))

Data Understanding

Check the number rows and columns

data.shapeoutput: (125000, 3)

Check the first 5 rows of the data

# head of data
data.head()output:

+-------------+------------+-------------+
| customer_id | trans_date | tran_amount |
+-------------+------------+-------------+
| CS5295      | 11-Feb-13  |          35 |
| CS4768      | 15-Mar-15  |          39 |
| CS2122      | 26-Feb-13  |          52 |
| CS1217      | 16-Nov-11  |          99 |
| CS1850      | 20-Nov-13  |          78 |
+-------------+------------+-------------+

More data understanding codes.

# number of unique customer_id
len(data.customer_id.unique())# total amount of transaction
data.tran_amount.sum()# total unique date
len(data.trans_date.unique())

Data Preparation

a. Recency

We will take column trans_date and customer_id, as the Recency value will be obtained from the trans_date. Then, we will group the latest transaction date by the customer_id.

# Defining Recency as two of the three columns in data:
recency = data[['trans_date', 'customer_id']]
# change format datarecency['trans_date'] = pd.to_datetime(recency.trans_date)
recency.head()

Now refers to the latest date available in the data, to which we will peg our recency dimensions on:

now = max(recency['trans_date'])# alternative = now = pd.to_datetime('today')

The groupby function in pandas allows of for grouping of many index column values or as we saw above there were more than one instances of a single customer purchasing so why not combine all their purchases?

recency = recency.groupby(['customer_id']).max()

Okay so, recency refers to the time since the last purchase. So lets find out the number of days:

recency_days = now - recency['trans_date']
recency_days = pd.DataFrame(recency_days)
recency_days.head()output:
+-------------+------------+
| customer_id | trans_date |
+-------------+------------+
| CS1112      | 61 days    |
| CS1113      | 35 days    |
| CS1114      | 32 days    |
| CS1115      | 11 days    |
| CS1116      | 203 days   |
+-------------+------------+

Create new dataset from dataframe recency_days for further analysis:

recency = pd.DataFrame(recency_days['trans_date'].astype('timedelta64[D]'))
recency.columns = ['recency']
recency.head()output:
+-------------+------------+
| customer_id | trans_date |
+-------------+------------+
| CS1112      |       61.0 |
| CS1113      |       35.0 |
| CS1114      |       32.0 |
| CS1115      |       11.0 |
| CS1116      |      203.0 |
+-------------+------------+

b. Frequency

The frequency will be obtained by count the number of transaction for each customer.

frequency = data[['customer_id', 'trans_date']]
frequency.rename(columns={"trans_date": "frequency"},inplace = True)

Count the number of times the customer has made purchases:

frequency = frequency.groupby(['customer_id']).count()
frequency.head()output:
+-------------+-----------+
| customer_id | frequency |
+-------------+-----------+
| CS1112      |        15 |
| CS1113      |        20 |
| CS1114      |        19 |
| CS1115      |        22 |
| CS1116      |        13 |
+-------------+-----------+

c. Monetary

For monetary, the value will be obtained by sum all of the total purchased by each customer.

# Monetary refers to the total money spent by a customer overtime:monetary = data[['customer_id', 'tran_amount']]
monetary.rename(columns={"tran_amount": "monetary"}, inplace=True)
# Sum up all the transactions of every respective customer:monetary = monetary.groupby(['customer_id']).sum()
monetary.head()output:
+-------------+----------+
| customer_id | monetary |
+-------------+----------+
| CS1112      |     1012 |
| CS1113      |     1490 |
| CS1114      |     1432 |
| CS1115      |     1659 |
| CS1116      |      857 |
+-------------+----------+

RFM Analysis

Combine 3 variables: Recency, Frequency, and Monetary (RFM)

# Finally concatenating the dataframes:rfm = pd.concat([recency, frequency, monetary], axis=1)
rfm.head()output:
+-------------+---------+-----------+----------+
| customer_id | recency | frequency | monetary |
+-------------+---------+-----------+----------+
| CS1112      |    61.0 |        15 |     1012 |
| CS1113      |    35.0 |        20 |     1490 |
| CS1114      |    32.0 |        19 |     1432 |
| CS1115      |    11.0 |        22 |     1659 |
| CS1116      |   203.0 |        13 |      857 |
+-------------+---------+-----------+----------+

Visualise the data distribution

Plot for the last day since the customer made a purchase:

plt.figure(figsize=(8,8))
sns.set_context("poster")
sns.distplot(rfm['recency'])
plt.xlabel('Days since last purchase')

Plot the number of times the customer has made a purchase:

plt.figure(figsize=(8,8))
sns.set_context("poster")
sns.distplot(rfm['frequency'])

Plot the total revenue that the particular customer brought in to the shop:

plt.figure(figsize=(8,8))
sns.set_context("poster")
sns.distplot(rfm['monetary'])
plt.xlabel('IDR')

Let’s use quantiles to split data into 3 categories since we will use RFM scale of 3.

rfm.quantile([.33, .66, 1], axis=0)output:

+-----------+---------+-----------+----------+
| quantiles | recency | frequency | monetary |
+-----------+---------+-----------+----------+
|      0.33 |    30.0 |      16.0 |    973.0 |
|      0.66 |    85.0 |      20.0 |   1414.0 |
|      1.00 |   857.0 |      39.0 |   2933.0 |
+-----------+---------+-----------+----------+

Copy the RFM dataset so that it isn't affected by the changes:

RFMscores = rfm.copy()

Automate slice quantile for each variable

RFMscores['recency_score']    = pd.qcut(RFMscores['recency'], 3, labels=[3, 2, 1])RFMscores['frequency_score']  = pd.qcut(RFMscores['frequency'], 3, labels=[1, 2, 3])RFMscores['monetary_score']   = pd.qcut(RFMscores['monetary'], 3, labels=[1, 2, 3])

RFM score result

Create RFMscore dataframe for new data with RFM result

RFMscores = RFMscores.reset_index()# Convert data typeRFMscores['recency_score'] = RFMscores.recency_score.astype(int)RFMscores['frequency_score'] = RFMscores.frequency_score.astype(int)RFMscores['monetary_score'] = RFMscores.monetary_score.astype(int)

Create a new column for RFM Score

RFMscores['rfm_score'] = RFMscores['recency_score'].map(str) + RFMscores['frequency_score'].map(str) + RFMscores['monetary_score'].map(str)

Append RFM score reference or label to the data from external dataset

Import the external data

url2 = "https://gitlab.com/priagungkhusuma/rfm_segmentation/-/raw/master/rfm_score_dim.csv"rfm_score_reference = pd.read_csv(StringIO(requests.get(url2).text), error_bad_lines=False)rfm_score_reference['rfm_score'] = rfm_score_reference.rfm_score.map(str)

Check the dataset

rfm_score_reference.head()output:
+----------------+-----------+
|  segment_name  | rfm_score |
+----------------+-----------+
| ABOUT TO SLEEP |       112 |
| ABOUT TO SLEEP |       113 |
| ABOUT TO SLEEP |       121 |
| ABOUT TO SLEEP |       122 |
| ABOUT TO SLEEP |       131 |
+----------------+-----------+

Merge both of the data

RFMscores = pd.merge(RFMscores, rfm_score_reference, on=['rfm_score'])# Save it as csv file
RFMscores.to_csv('rfm_score.csv', index=False)

Results & Insights

Treemap Visualisation

Most of the customers are about to sleep. These customers need more attention to go back buying things in the store. To attract the customers back, it can be done by giving out the promos and discounts. The proportion of lost and loyal customer are similar, around 13%.

That’s all about RFM method. This method is usually use for marketing in order to understand the customers and have a better decision to create a target campaign.

In this article, I discuss about data analysis. If you are interested in Machine learning and Artificial Intelligent, I just wrote one article about Machine learning (ML) algorithm for clustering. You can check more in here:

Unsupervised Learning: Hierarchical Clustering and DBSCAN | by Alifia C Harmadi | Analytics Vidhya | Medium

Moreover, If you like to read more data analysis, machine learning, deep learning, data visualisation, and even story of working experience in data analytics role, you can check in here: Data Folks Indonesia — Medium. This publication is from Jakarta AI Research.

Feel free to join our discord channel to get more information and make some friends!

Jakarta AI Research discord link invitation: https://discord.gg/6v28dq8dRE

I think that’s enough from me! See you in other articles 👋👋👋