RFM with Python (Study Case: Ecommerce Dataset)
Customer segmentation is the practice of dividing a customer base into groups of individuals that are similar in specific ways. One of the techniques of customer segmentation is RFM (recency, frequency, monetary). RFM segmentation allows marketers to target specific clusters of customers with communications that are much more relevant to their behavior.
Objective
Create best clustering for customer segmentation.
RFM
RFM Analysis is a technique to group customers based on the recency, frequency, and monetary total of their recent transactions to identify the best customers and perform targeted marketing campaigns.
Recency: How recent was the customer’s last purchase?
Frequency: How often did this customer make a purchase in a given period?
Monetary: How much money did the customer spend in a given period?
Data Preparation
Import Library
First, we need to prepare the library that we will be used.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
numpy: library provided by Python to facilitate computational operations of numeric data types.
pandas: library in Python provide data structure and data analysis.
matplotlib: multiplatform data visualization library built on NumPy array arrays.
seaborn: Python data visualization library based on matplotlib.
Load Data
Data is taken from the data warehouse. Only the following table will be used.
Convert order_date_key data type to datetime
The data type of order_date_key is int64, while we need these to be datetime data type so we can get the currency. This is how I convert integer to datetime.
# Convert order_date_key to datetime
get_time = ['order_date_key']for column in get_time:
df_orders.loc[:, column] = pd.to_datetime(df_orders[column], format='%Y%m%d')
Join Between Customers & Orders
Inner join the customer & fact orders dimension table and only use the user_name, order_date_key, order_id, and total_payment columns.
# Join between df_customers and df_ordersdf = pd.merge(df_customers, df_orders, on = 'customer_key', how='inner').loc[:, ['user_name', 'order_date_key', 'order_id', 'total_payment']]
Build Recency, Frequency, and Monetary
This is how I build the RFM. The reason I did time_limit + 1 is so that in the most recent order the result is not equal to 0. A result of 0 can cause the data to be ambiguous.
# Time limit
import datetime as dt
time_limit = max(df.order_date_key) + dt.timedelta(1)# Calculate count of order item id, total payment per order, and frequency of order
df = df.groupby('user_name').agg({'order_date_key' : lambda x: (time_limit - x.max()).days,
'order_id' : pd.Series.nunique,
'total_payment' : 'sum',
})# Rename columns
df.rename(columns = {'order_date_key' : 'recency',
'order_id' : 'frequency',
'total_payment' : 'monetary'}, inplace = True)
Scaling (For Visualization)
The reason I do this is so that the visualizations between models can be compared with each other because they refer to the same value.
from sklearn.preprocessing import StandardScalerscaler = StandardScaler()df_scaled_vis = scaler.fit_transform(df)
df_scaled_vis = pd.DataFrame(df_scaled_vis, index = df.index, columns = df.columns)df_scaled_vis.reset_index(level=0, inplace=True)
Modelling For RFM
Handling Outliers for RFM
I handle the outlier using IQR. Handling outliers using python, then outliers on recency and monetary are removed. This is because the outliers on the two attributes are few. Meanwhile, if we remove the outliers frequency, there will be only frequency 1. So I don’t remove outliers in frequency. The purpose of handling outliers to see the quantile, but I still include all of the data during RFM Segment.
# Handling outlier using IQR
def limit(x):
Q1 = df_rfm[x].quantile(0.25)
Q3 = df_rfm[x].quantile(0.75)
IQR = Q3 - Q1 minimum_limit = Q1 - (1.5 * IQR)
maximum_limit = Q3 + (1.5 * IQR) print('Minimum limit', minimum_limit)
print('Maximum limit', maximum_limit)
print('Outliers percentage under the minimum limit: {} %'.format(df_rfm[(df_rfm[x] < minimum_limit)].shape[0] / df_rfm.shape[0]*100))
print('Outliers percentage over the maximum limit: {} %'.format(df_rfm[(df_rfm[x] > maximum_limit)].shape[0] / df_rfm.shape[0]*100))
The red row means the outliers that will be dropped.
After handling outliers, check the quantiles on all three attributes.
df_rfm.describe(include='all')
· The quantiles will be used as a divider for the rfm segment, except for the frequency
· RFM is divided into 4 segments where The segment with the lowest value is 1 and the highest is 4.
· For monetary, segment 1 is for customers that spend very little money, while segment 4 is for customers that spend so much money.
· The recency will be reversed. So segment 4 is for customers who recently come to orders, meanwhile segment 1 is for customers who haven’t orders for long periods.
· For frequency, segment 1 is for customers who shopped 1 time, segment 2 is for customers who shopped 2 times, segment 3 is customers who shopped 3 times, and customers who shopped more than or equal to 4 are segment 4.
Form a RFM Segments
RFM Segments build from quantile that can be seen using function describe in Pandas.
Visualize the distribution of the RFM using plotly
from plotly.subplots import make_subplots
import plotly.graph_objects as godf_count_r = df_rfm2.groupby('RFM_R')['user_name'].nunique().reset_index()
df_count_f = df_rfm2.groupby('RFM_F')['user_name'].nunique().reset_index()
df_count_m = df_rfm2.groupby('RFM_M')['user_name'].nunique().reset_index()fig = make_subplots(rows=3, cols=1, shared_yaxes=True, subplot_titles=('Recency','Frequency', 'Monetary'))
fig.add_trace(go.Bar(x=df_count_r['RFM_R'], y=df_count_r['user_name']), 1, 1)
fig.add_trace(go.Bar(x=df_count_f['RFM_F'], y=df_count_f['user_name']), 2, 1)
fig.add_trace(go.Bar(x=df_count_m['RFM_M'], y=df_count_m['user_name']), 3, 1)fig.update_layout(showlegend=False, width=800, height=1200,)fig.show()
After forming 4 segments, these segments are grouped into RFM Group. Each rfm group has a different name and marketing strategy.
Visualization of RFM
Advantage:
•Can divide customers based on specific criteria (highest shopping frequency, highest spending, customers who have high frequency, monetary, and high recency)
Normalization for K-Means, Birch, and Optics Model
In this part, I only normalize the frequency using np.log. Normalization is only carried out on frequency because the total number of customers in each frequency has a fairly large inequality.
# Convert the data into a normal distribution
# More reliable predictions are made if the predictors and the target variable are normally distributed
from scipy import statsdf_normal = df[['recency', 'monetary']]df_normal['frequency'] = df[['frequency']].apply(np.log, axis = 1).round(3)
Scaling for K-Means, Birch, and Optics
Scaling is using StandardScaler. Feature scaling is a method used to normalize the range of independent variables or features of data.
from sklearn.preprocessing import StandardScalerscaler = StandardScaler()df_scaled = scaler.fit_transform(df_normal)
K-Means
K-Means: The k-means algorithm requires input parameters, k, and partitions a set of n objects into k clusters so that the result of intracluster similarity is high but the intercluster similarity is low.
The process of the k-means algorithm is, first, the algorithm randomly selects k objects, each initialized to represent the middle value or center cluster. For each remaining object, an object is placed in the most similar cluster, which is based on the distance between the objects and the cluster mean (cluster mean). Then calculate the new median value for each cluster. This process is iterated until the function criteria converge.
Here I take two approaches to get the cluster value of k, namely using the elbow method and the silhouette score².
K-Means Elbow Method
from sklearn.cluster import KMeanswcss = {}for i in range (1, 11):kmeans = KMeans(n_clusters = i, init= 'k-means++', max_iter=100)
kmeans.fit(df_kmeans)
wcss[i] = kmeans.inertia_# Visualizefig, ax = plt.subplots(figsize=(16,9))
sns.pointplot(x = list(wcss.keys()), y = list(wcss.values()))
plt.title('K-Means Elbow')
plt.xlabel('K Numbers')
plt.ylabel('WCSS')plt.show()
If you can see, it is difficult to see the elbow from the following visualization. So I use another approach, namely Silhouette Score.
K-Means Silhouette Score
from sklearn.metrics import silhouette_score
silhouette_sc ={}for i in range (2, 11):kmeans = KMeans(n_clusters = i, init= 'k-means++', max_iter=100)
kmeans.fit(df_kmeans)
silhouette_sc[i] = silhouette_score(df_kmeans, kmeans.labels_, metric = 'euclidean')# Visualizefig, ax = plt.subplots(figsize=(16,9))
sns.pointplot(x = list(silhouette_sc.keys()), y = list(silhouette_sc.values()))
plt.title('K-Means Silhouette')
plt.xlabel('K Numbers')
plt.ylabel('Silhoeutte')plt.show()
To see each silhouette score
for key, value in silhouette_sc.items():
print('-------------------')
print('n_clusters = ', key)
print('Silhouette score = ', value)
From the Silhouette Score, the highest point is taken, which is 4. Why not 2? Because the division of the 2 clusters is clearly separate, so it is not good for clustering.
from sklearn.cluster import KMeans# Pick the cluster
kmeans = KMeans(n_clusters = 4, init = 'k-means++', max_iter = 100)
kmeans.fit(df_kmeans)
Visualization of K-Means
Advantage:
- Can separate customers who have a high shopping frequency, high spending, and high recency.
- 0: High recency, low frequency, low monetary
- 1: Low recency, low frequency, low monetary
- 2: High Frequency
- 3: High monetary, Low frequency
Birch
Birch:
BIRCH is an integrated hierarchical clustering method. BIRCH introduces two concepts, feature clustering and clustering feature tree (CF tree), which are used to describe cluster summaries. These structures help clustering methods achieve good speed and scalability in large databases.
How does the BIRCH algorithm work? There are two phases:
Phase 1: BIRCH scans (reads) the database to build an initial memory CF tree, which can be viewed as a multilevel compression of the data trying to maintain the clustering structure of the data.
Phase 2: BIRCH uses a selection clustering algorithm to cluster leaf nodes of the CF tree².
Birch Silhouette Score
from sklearn.cluster import Birchfrom sklearn.metrics import silhouette_scoresilhouette_sc ={}for i in range (2, 11):
birch = Birch(n_clusters = i)
birch.fit(df_birch)silhouette_sc[i] = silhouette_score(df_birch, birch.predict(df_birch), metric = 'euclidean')# Visualize
fig, ax = plt.subplots(figsize=(16,9))
sns.pointplot(x = list(silhouette_sc.keys()), y = list(silhouette_sc.values()))
plt.title('Birch Silhouette')
plt.ylabel('Silhouette')plt.show()
From the Silhouette Score, the highest point is taken, which is 4. Why not 2? Because the division of the 2 clusters is clearly separate, so it is not good for clustering.
# Creating the BIRCH clustering model
from sklearn.cluster import Birch
birch = Birch(n_clusters = 4)
birch.fit(df_birch)
Visualization of Birch
Advantage:
- Can separate customers who have a high shopping frequency, and divide customers up to 3 segments based on high spending
- 0: High monetary
- 1: Middle monetary
- 2: Low monetary
- 3: High frequency
Optics
Optics:
OPTICS (Ordering Points To Identify the Clustering Structure), closely related to DBSCAN, finds core sample of high density and expands clusters from them¹. Unlike DBSCAN, keeps cluster hierarchy for a variable neighborhood radius. Better suited for usage on large datasets than the current sklearn implementation of DBSCAN.
Optics Silhouette Score
from sklearn.cluster import OPTICS
from sklearn.metrics import silhouette_scoresilhouette_sc ={}for i in range (2, 11):
optics = OPTICS(max_eps=i, min_samples=300)
optics.fit(df_optics)
silhouette_sc[i] = silhouette_score(df_optics, optics.labels_, metric = 'euclidean')# Visualize
fig, ax = plt.subplots(figsize=(16,9))
sns.pointplot(x = list(silhouette_sc.keys()), y = list(silhouette_sc.values()))
plt.title('Optics Silhouette')
plt.xlabel('Epsilon')
plt.ylabel('Silhouette')plt.show()
The highest point is 6–10. I choose 6.
# Creating the OPTICS clustering modelfrom sklearn.cluster import OPTICSoptics = OPTICS(min_samples=300, max_eps=6.0)
optics.fit(df_optics)
Visualization of Optics
Advantage:
- Can separate customers who have the most frequency or monetary
- -1: Can called as outlier, either too high monetary or frequency
- 0: Low frequency
- 1: Middle frequency
Silhouette Score Summary
In the table above, it is known that the highest silhouette score is Birch. In terms of dividing customer segments, RFM is a good model because it can adapt to marketing needs.
However, among k-means, birch, and optics, in my opinion the best is k-means, because it also divides customers based on recency so that from a marketing perspective, customers with low recency, frequency, monetary can be marked not to need to be reached out , but need to look at customer segments that need more attention.
While the distribution on Birch and Optics models is uneven.
Thank you for reading!
References:
[1] Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. “OPTICS: ordering points to identify the clustering structure.” ACM SIGMOD Record 28, no. 2 (1999): 49–60.
[2] Han, Jiawei & Kamber, Micheline, Data Mining — Consepts and Techniques, Simon Fraser University, USA : Morgan Kaufmann, 2001.