RFM with Python (Study Case: Ecommerce Dataset)

Katarina Nimas Kusumawati
10 min readAug 22, 2021

--

Customer segmentation is the practice of dividing a customer base into groups of individuals that are similar in specific ways. One of the techniques of customer segmentation is RFM (recency, frequency, monetary). RFM segmentation allows marketers to target specific clusters of customers with communications that are much more relevant to their behavior.

Objective

Create best clustering for customer segmentation.

RFM

RFM Analysis is a technique to group customers based on the recency, frequency, and monetary total of their recent transactions to identify the best customers and perform targeted marketing campaigns.

Recency: How recent was the customer’s last purchase?

Frequency: How often did this customer make a purchase in a given period?

Monetary: How much money did the customer spend in a given period?

Data Preparation

Import Library

First, we need to prepare the library that we will be used.

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

numpy: library provided by Python to facilitate computational operations of numeric data types.

pandas: library in Python provide data structure and data analysis.

matplotlib: multiplatform data visualization library built on NumPy array arrays.

seaborn: Python data visualization library based on matplotlib.

Load Data

Data is taken from the data warehouse. Only the following table will be used.

Dimension Customers
Fact Order

Convert order_date_key data type to datetime

The data type of order_date_key is int64, while we need these to be datetime data type so we can get the currency. This is how I convert integer to datetime.

# Convert order_date_key to datetime
get_time = ['order_date_key']
for column in get_time:
df_orders.loc[:, column] = pd.to_datetime(df_orders[column], format='%Y%m%d')

Join Between Customers & Orders

Inner join the customer & fact orders dimension table and only use the user_name, order_date_key, order_id, and total_payment columns.

# Join between df_customers and df_ordersdf = pd.merge(df_customers, df_orders, on = 'customer_key', how='inner').loc[:, ['user_name', 'order_date_key', 'order_id', 'total_payment']]

Build Recency, Frequency, and Monetary

This is how I build the RFM. The reason I did time_limit + 1 is so that in the most recent order the result is not equal to 0. A result of 0 can cause the data to be ambiguous.

Build RFM
# Time limit
import datetime as dt
time_limit = max(df.order_date_key) + dt.timedelta(1)
# Calculate count of order item id, total payment per order, and frequency of order
df = df.groupby('user_name').agg({'order_date_key' : lambda x: (time_limit - x.max()).days,
'order_id' : pd.Series.nunique,
'total_payment' : 'sum',
})
# Rename columns
df.rename(columns = {'order_date_key' : 'recency',
'order_id' : 'frequency',
'total_payment' : 'monetary'}, inplace = True)

Scaling (For Visualization)

The reason I do this is so that the visualizations between models can be compared with each other because they refer to the same value.

from sklearn.preprocessing import StandardScalerscaler = StandardScaler()df_scaled_vis = scaler.fit_transform(df)
df_scaled_vis = pd.DataFrame(df_scaled_vis, index = df.index, columns = df.columns)
df_scaled_vis.reset_index(level=0, inplace=True)
Result from Scaling

Modelling For RFM

Handling Outliers for RFM

I handle the outlier using IQR. Handling outliers using python, then outliers on recency and monetary are removed. This is because the outliers on the two attributes are few. Meanwhile, if we remove the outliers frequency, there will be only frequency 1. So I don’t remove outliers in frequency. The purpose of handling outliers to see the quantile, but I still include all of the data during RFM Segment.

# Handling outlier using IQR
def limit(x):
Q1 = df_rfm[x].quantile(0.25)
Q3 = df_rfm[x].quantile(0.75)
IQR = Q3 - Q1
minimum_limit = Q1 - (1.5 * IQR)
maximum_limit = Q3 + (1.5 * IQR)
print('Minimum limit', minimum_limit)
print('Maximum limit', maximum_limit)
print('Outliers percentage under the minimum limit: {} %'.format(df_rfm[(df_rfm[x] < minimum_limit)].shape[0] / df_rfm.shape[0]*100))
print('Outliers percentage over the maximum limit: {} %'.format(df_rfm[(df_rfm[x] > maximum_limit)].shape[0] / df_rfm.shape[0]*100))
Recency
Frequency
monetary

The red row means the outliers that will be dropped.

After handling outliers, check the quantiles on all three attributes.

df_rfm.describe(include='all')

· The quantiles will be used as a divider for the rfm segment, except for the frequency

· RFM is divided into 4 segments where The segment with the lowest value is 1 and the highest is 4.

· For monetary, segment 1 is for customers that spend very little money, while segment 4 is for customers that spend so much money.

· The recency will be reversed. So segment 4 is for customers who recently come to orders, meanwhile segment 1 is for customers who haven’t orders for long periods.

· For frequency, segment 1 is for customers who shopped 1 time, segment 2 is for customers who shopped 2 times, segment 3 is customers who shopped 3 times, and customers who shopped more than or equal to 4 are segment 4.

Limitation

Form a RFM Segments

RFM Segments build from quantile that can be seen using function describe in Pandas.

Recency Segment
Frequency Segment
Monetary Segment

Visualize the distribution of the RFM using plotly

from plotly.subplots import make_subplots
import plotly.graph_objects as go
df_count_r = df_rfm2.groupby('RFM_R')['user_name'].nunique().reset_index()
df_count_f = df_rfm2.groupby('RFM_F')['user_name'].nunique().reset_index()
df_count_m = df_rfm2.groupby('RFM_M')['user_name'].nunique().reset_index()
fig = make_subplots(rows=3, cols=1, shared_yaxes=True, subplot_titles=('Recency','Frequency', 'Monetary'))
fig.add_trace(go.Bar(x=df_count_r['RFM_R'], y=df_count_r['user_name']), 1, 1)
fig.add_trace(go.Bar(x=df_count_f['RFM_F'], y=df_count_f['user_name']), 2, 1)
fig.add_trace(go.Bar(x=df_count_m['RFM_M'], y=df_count_m['user_name']), 3, 1)
fig.update_layout(showlegend=False, width=800, height=1200,)fig.show()
Visualization using Plotly

After forming 4 segments, these segments are grouped into RFM Group. Each rfm group has a different name and marketing strategy.

RFM Segment

Visualization of RFM

Advantage:

•Can divide customers based on specific criteria (highest shopping frequency, highest spending, customers who have high frequency, monetary, and high recency)

RFM Cluster
Count Number of Customer in Each Cluster (RFM)
RFM 3D Visualization
Legend

Normalization for K-Means, Birch, and Optics Model

In this part, I only normalize the frequency using np.log. Normalization is only carried out on frequency because the total number of customers in each frequency has a fairly large inequality.

Before Normalization
# Convert the data into a normal distribution
# More reliable predictions are made if the predictors and the target variable are normally distributed
from scipy import stats
df_normal = df[['recency', 'monetary']]df_normal['frequency'] = df[['frequency']].apply(np.log, axis = 1).round(3)
After Normalization

Scaling for K-Means, Birch, and Optics

Scaling is using StandardScaler. Feature scaling is a method used to normalize the range of independent variables or features of data.

Equation for Scaling
from sklearn.preprocessing import StandardScalerscaler = StandardScaler()df_scaled = scaler.fit_transform(df_normal)
After Scaling

K-Means

K-Means: The k-means algorithm requires input parameters, k, and partitions a set of n objects into k clusters so that the result of intracluster similarity is high but the intercluster similarity is low.

The process of the k-means algorithm is, first, the algorithm randomly selects k objects, each initialized to represent the middle value or center cluster. For each remaining object, an object is placed in the most similar cluster, which is based on the distance between the objects and the cluster mean (cluster mean). Then calculate the new median value for each cluster. This process is iterated until the function criteria converge.

Here I take two approaches to get the cluster value of k, namely using the elbow method and the silhouette score².

K-Means Elbow Method

from sklearn.cluster import KMeanswcss = {}for i in range (1, 11):kmeans = KMeans(n_clusters = i, init= 'k-means++', max_iter=100)
kmeans.fit(df_kmeans)
wcss[i] = kmeans.inertia_
# Visualizefig, ax = plt.subplots(figsize=(16,9))
sns.pointplot(x = list(wcss.keys()), y = list(wcss.values()))
plt.title('K-Means Elbow')
plt.xlabel('K Numbers')
plt.ylabel('WCSS')
plt.show()
K-Means Elbow

If you can see, it is difficult to see the elbow from the following visualization. So I use another approach, namely Silhouette Score.

K-Means Silhouette Score

from sklearn.metrics import silhouette_score
silhouette_sc ={}
for i in range (2, 11):kmeans = KMeans(n_clusters = i, init= 'k-means++', max_iter=100)
kmeans.fit(df_kmeans)
silhouette_sc[i] = silhouette_score(df_kmeans, kmeans.labels_, metric = 'euclidean')
# Visualizefig, ax = plt.subplots(figsize=(16,9))
sns.pointplot(x = list(silhouette_sc.keys()), y = list(silhouette_sc.values()))
plt.title('K-Means Silhouette')
plt.xlabel('K Numbers')
plt.ylabel('Silhoeutte')
plt.show()
K-Means Silhouette Score

To see each silhouette score

for key, value in silhouette_sc.items():
print('-------------------')
print('n_clusters = ', key)
print('Silhouette score = ', value)
K-Means Silhouette Score

From the Silhouette Score, the highest point is taken, which is 4. Why not 2? Because the division of the 2 clusters is clearly separate, so it is not good for clustering.

from sklearn.cluster import KMeans# Pick the cluster
kmeans = KMeans(n_clusters = 4, init = 'k-means++', max_iter = 100)
kmeans.fit(df_kmeans)

Visualization of K-Means

Advantage:

  • Can separate customers who have a high shopping frequency, high spending, and high recency.
  • 0: High recency, low frequency, low monetary
  • 1: Low recency, low frequency, low monetary
  • 2: High Frequency
  • 3: High monetary, Low frequency
K-Means Cluster
Count Number of Customers in Each Cluster (KMeans)
K-Means 3D Visualization

Birch

Birch:

BIRCH is an integrated hierarchical clustering method. BIRCH introduces two concepts, feature clustering and clustering feature tree (CF tree), which are used to describe cluster summaries. These structures help clustering methods achieve good speed and scalability in large databases.

How does the BIRCH algorithm work? There are two phases:
Phase 1: BIRCH scans (reads) the database to build an initial memory CF tree, which can be viewed as a multilevel compression of the data trying to maintain the clustering structure of the data.
Phase 2: BIRCH uses a selection clustering algorithm to cluster leaf nodes of the CF tree².

Birch Silhouette Score

from sklearn.cluster import Birchfrom sklearn.metrics import silhouette_scoresilhouette_sc ={}for i in range (2, 11):
birch = Birch(n_clusters = i)
birch.fit(df_birch)
silhouette_sc[i] = silhouette_score(df_birch, birch.predict(df_birch), metric = 'euclidean')# Visualize
fig, ax = plt.subplots(figsize=(16,9))
sns.pointplot(x = list(silhouette_sc.keys()), y = list(silhouette_sc.values()))
plt.title('Birch Silhouette')
plt.ylabel('Silhouette')
plt.show()
Birch Silhouette Score
Birch Silhouette Score

From the Silhouette Score, the highest point is taken, which is 4. Why not 2? Because the division of the 2 clusters is clearly separate, so it is not good for clustering.

# Creating the BIRCH clustering model
from sklearn.cluster import Birch
birch = Birch(n_clusters = 4)
birch.fit(df_birch)

Visualization of Birch

Advantage:

  • Can separate customers who have a high shopping frequency, and divide customers up to 3 segments based on high spending
  • 0: High monetary
  • 1: Middle monetary
  • 2: Low monetary
  • 3: High frequency
Birch Cluster
Count Number of Customers in Each Cluster (Birch)
Birch 3D Visualization

Optics

Optics:

OPTICS (Ordering Points To Identify the Clustering Structure), closely related to DBSCAN, finds core sample of high density and expands clusters from them¹. Unlike DBSCAN, keeps cluster hierarchy for a variable neighborhood radius. Better suited for usage on large datasets than the current sklearn implementation of DBSCAN.

Optics Silhouette Score

from sklearn.cluster import OPTICS
from sklearn.metrics import silhouette_score
silhouette_sc ={}for i in range (2, 11):
optics = OPTICS(max_eps=i, min_samples=300)
optics.fit(df_optics)
silhouette_sc[i] = silhouette_score(df_optics, optics.labels_, metric = 'euclidean')
# Visualize
fig, ax = plt.subplots(figsize=(16,9))
sns.pointplot(x = list(silhouette_sc.keys()), y = list(silhouette_sc.values()))
plt.title('Optics Silhouette')
plt.xlabel('Epsilon')
plt.ylabel('Silhouette')
plt.show()
Optics Silhouette Score
Optics Silhouette Score

The highest point is 6–10. I choose 6.

# Creating the OPTICS clustering modelfrom sklearn.cluster import OPTICSoptics = OPTICS(min_samples=300, max_eps=6.0)
optics.fit(df_optics)

Visualization of Optics

Advantage:

  • Can separate customers who have the most frequency or monetary
  • -1: Can called as outlier, either too high monetary or frequency
  • 0: Low frequency
  • 1: Middle frequency
Optics Cluster
Count Number of Customers in Each Cluster
Optics 3D Visualization

Silhouette Score Summary

In the table above, it is known that the highest silhouette score is Birch. In terms of dividing customer segments, RFM is a good model because it can adapt to marketing needs.

However, among k-means, birch, and optics, in my opinion the best is k-means, because it also divides customers based on recency so that from a marketing perspective, customers with low recency, frequency, monetary can be marked not to need to be reached out , but need to look at customer segments that need more attention.

While the distribution on Birch and Optics models is uneven.

Thank you for reading!

References:

[1] Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. “OPTICS: ordering points to identify the clustering structure.” ACM SIGMOD Record 28, no. 2 (1999): 49–60.

[2] Han, Jiawei & Kamber, Micheline, Data Mining — Consepts and Techniques, Simon Fraser University, USA : Morgan Kaufmann, 2001.

--

--

Katarina Nimas Kusumawati

Sometimes I struggle with data, sometimes I just wanna be a Pikachu