Segmentation of customers in online retail databases using Python, including RFM analysis and clustering.
Give me a clap if you find my article is useful.👏👏👏
GitHub
https://melodyyip.github.io/RFM-UCI-onlineStore/
Getting started
In the following analysis, I am going to use the Online Retail Data Set, which was obtained from the UCI Machine Learning repository. The link to the data can be found here.
In this article, I am going to write about how to carry out customer segmentation and RFM analysis on online retail data using python.
Data
This data set contains all of the transactions recorded for an online retailer based and registered in the UK between 2009–12–01 and 2011–12–09. The retailer specializes in all-occasion gift items. Most of the retailer’s customers are wholesalers.
Column Descriptions
Plan
1. Reading data and preprocessing
2. Create Recency Frequency Monetary (RFM) table
3. Model — Clustering with K-means algorithm
4. Interpret the result
Step 1: Reading data and preprocessing
After downloading the csv file, we can load it into a Pandas dataframe using the pandas.read_csv function.
We need to make sure the data is clean before starting your analysis. As a reminder, we should check for:
- Duplicate records
- Consistent formatting
- Missing values
- Obviously wrong values
I am not going to show the step here, you can check my GitHub for this part.
Check my GitHub to took a look at all my projects🍀
Step 2: Create Recency Frequency Monetary (RFM) table
RFM is a basic customer segmentation algorithm based on their purchasing behavior. The behavior is identified by using only three customer data points:
Recency: the recency of purchase/ How many days ago was their last purchase?
Frequency: the frequency of purchases/ total number of purchases/How many times has the customer purchased from our store?
Monetary: the mean monetary value of each purchase/the amount they have spent/How much has this customer spent? Again limit to last two years — or take all time
The RFM Analysis will help the businesses to segment their customer base into different homogenous groups so that they can engage with each group with different targeted marketing strategies. Sometime RMF is also used to identify the High-Value Customers (HVCs).
Before we get into the process, I will give you a brief on what kind of steps we will get.
# RFM table
# Aggregate data by each customer
rfm = df.groupby('Customer ID').agg({'InvoiceDate': lambda x: (snapshot_date - x.max()).days, 'Invoice': lambda x: len(x), 'Revenue': lambda x: x.sum()}).reset_index()
rfm['InvoiceDate'] = rfm['InvoiceDate'].astype(int)# Rename columns
rfm.rename(columns={'InvoiceDate': 'Recency',
'Invoice': 'Frequency',
'Revenue': 'MonetaryValue'}, inplace=True)
Right now, the dataset consists of recency, frequency, and monetary value column. But we cannot use the dataset yet because we have to preprocess the data more.
Manage Skewness and Scaling
We have to make sure that the data meet these assumptions:
The data should meet assumptions where the variables are not skewed and have the same mean and variance.
Because of that, we have to manage the skewness of the variables. Here are the visualizations of each variable.
As we can see from above, we have to transform the data, so it has a more symmetrical form. There are some methods that we can use to manage the skewness:
- log transformation
- square root transformation
- box-cox transformation Note: We can use the transformation if and only if the variable only has positive values.
Based on that calculation, we will utilize variables that use box-cox transformations. Except for the MonetaryValue variable because the variable includes negative values. To handle this variable, we can use cubic root transformation to the data.
Each variable don’t have the same mean and variance. We have to normalize it. To normalize, we can use StandardScaler object from scikit-learn library to do it.
Finally, we can do clustering using that data.
Step 3 Model — Clustering with K-means algorithm
To make segmentation from the data, we can use the K-Means algorithm to do this.
K-Means algorithm is an unsupervised learning algorithm that uses the geometrical principle to determine which cluster belongs to the data. By determine each centroid, we calculate the distance to each centroid. Each data belongs to a centroid if it has the smallest distance from the other. It repeats until the next total of the distance doesn’t have significant changes than before.
Determine the Optimal K
To make our clustering reach its maximum performance, we have to determine which hyperparameter fits to the data. To determine which hyperparameter is the best for our model and data, we can use the elbow method to decide.
The x-axis is the value of the k, and the y-axis is the SSE value of the data. We will take the best parameter by looking at where the k-value will have a linear trend on the next consecutive k. From the above plot, the k-value of 4 is the best hyperparameter for our model because the next k-value tend to have a linear trend.
Fit the model
model = KMeans(n_clusters=4, random_state=42)
model.fit(customers_normalized)
model.labels_.shaperfm["Cluster"] = model.labels_
rfm.head()
rfm.groupby('Cluster').agg({
'Recency':'mean',
'Frequency':'mean',
'MonetaryValue':['mean', 'count']}).round(1)
From the above table, we can compare the distribution of mean values of recency, frequency, and monetary metrics across 4 cluster data. It seems that we get a more detailed distribution of our customer base using k=4.
Another commonly used method to compare the cluster segments is Snakeplots. They are commonly used in marketing research to understand customer perceptions.
Cluster Exploration and Visualization
Snake Plots
Besides that, we can analyze the segments using snake plot. It requires the normalized dataset and also the cluster labels. By using this plot, we can have a good visualization from the data on how the cluster differs from each other.
From the above snake plot, we can see the distribution of recency, frequency, and monetary metric values across the four clusters. The four clusters seem to be separate from each other, which indicates a good heterogeneous mix of clusters.
Scatter Plot
The scatter plot is the data analysis method we use when we have more than two variables. Remove the outlier from the plot to create a clear visualization result. Those outliers are taken into consideration in the model development. Exclude them only for visualization purposes.
Recency Vs frequency
A high frequency is found with customers who have a recent purchase within a month.
Frequency Vs Monetary
Customers who buy frequently spend less money.
Recency Vs Frequency Vs Monetary
In the above plot, the color specifies Cluster. From the above plot, we can see how the customers are spread among Recency, Frequency and Monetary dimension. Customers in Cluster 1 have made recent purchases with a high frequency, but with lower amounts. The reason for this could be that the customer frequently purchase Accessories that are not so expensive.
Relative importance of attributes by cluster
From the above analysis, we can see that there should be 4 clusters in our data. The Heatmap above get the related importance of attributes among the clusters. Monetary Value is high positively correlated with Cluster 3(with a Person’s correlation coefficient of 18.21)
Using the RFM segmentation to identify the type of customer according to RFM score
Calculate the overall RFM score
This step can be done in two ways:
- Concatenation: creates segments Here we just concatenate (not add) the individual RFM score like strings and get labeled segments in return. Our best segment will be 444 and our worst will be 111 — signifying the lowest score on all three of the RFM categories.
- Addition: creates a score Here we add the individual RFM scores like numbers and get a number in return indicating the customer score. The score will range from 3 to 12 and we can use this to create more human friendly labelled categories.
#Define quartiles for RFM score:
quantiles = rfm.quantile(q=[0.25,0.5,0.75])
quantiles = quantiles.to_dict()
def RFMScore(x,p,d):
if x <= d[p][0.25]:
return 1
elif x <= d[p][0.50]:
return 2
elif x <= d[p][0.75]:
return 3
else:
return 4
rfm['R'] = rfm['Recency'].apply(RFMScore, args=('Recency',quantiles,))
rfm['F'] = rfm['Frequency'].apply(RFMScore, args=('Frequency',quantiles,))
rfm['M'] = rfm['MonetaryValue'].apply(RFMScore, args=('MonetaryValue',quantiles,))
# Concat RFM quartile values to create RFM Segments
def join_rfm(x): return str(x['R']) + str(x['F']) + str(x['M'])
rfm['RFM_Segment'] = rfm.apply(join_rfm, axis=1)
# Calculate RFM_Score
rfm['RFM_Score'] = rfm[['R','F','M']].sum(axis=1)
After calculations on the RFM data we can create customer segments that are actionable and easy to understand.
# Create human friendly RFM labels
segt_map = {
r'[1-2][1-2]': 'Hibernating',
r'[1-2][3-4]': 'At risk',
r'[1-2]5': 'Can\'t lose them',
r'3[1-2]': 'About to sleep',
r'33': 'Need attention',
r'[3-4][4-5]': 'Loyal customers',
r'41': 'Promising',
r'51': 'New customers',
r'[4-5][2-3]': 'Potential loyalists',
r'5[4-5]': 'Champions'
}
# rfm['Segment'] = rfm['R'].map(str) + rfm['F'].map(str)+ rfm['M'].map(str)
rfm['Segment'] = rfm['R'].map(str) + rfm['F'].map(str)
rfm['Segment'] = rfm['Segment'].replace(segt_map, regex=True)
# Create some human friendly labels for the scores
rfm['Score'] = 'Green'
rfm.loc[rfm['RFM_Score']>5,'Score'] = 'Bronze'
rfm.loc[rfm['RFM_Score']>7,'Score'] = 'Silver'
rfm.loc[rfm['RFM_Score']>9,'Score'] = 'Gold'
rfm.loc[rfm['RFM_Score']>10,'Score'] = 'Platinum'
Tree map of the customer segment and score
Interpret the result
Based on RFM analysis, there are 8% of loyal customers who tend to spend big amount of money while buying. Also there are groups of customers who are already lost and who are going to be lost in near future.
Step 4 Action to take for retain the customer
Here’s a handy chart of all the RFM Segments, and some actionable tips for each which can implement straight away!
Further analysis
Addition of new variables like Tenure: The number of days since the first transaction by each customer. This will tell us how long each customer has been with the system. Conducting deeper segmentation on customers based on their geographical location, and demographic and psychographic factors.
Give me a clap if you find my article is useful.👏👏👏
Check my GitHub to took a look at the whole project🍀
References
[1]Know your customers with RFM (2020), Leif Arne Bakker, https://futurice.com/blog/know-your-customers-with-rfm
[2]RFM analysis for Customer Segmentation, https://clevertap.com/blog/rfm-analysis/
[3]https://towardsdatascience.com/customer-segmentation-in-python-9c15acf6f945
[4]https://towardsdatascience.com/know-your-customers-with-rfm-9f88f09433bc
[5]https://towardsdatascience.com/customer-segmentation-in-online-retail-1fc707a6f9e6
[6]https://www.prospectsoft.com/blogarticle/541/Using-RFM-Segmentation-to-grow-your-wholesale-distribution-or-manufacturing-business
[7]https://github.com/rahulkhandelwal396/Customer-Segmentation/blob/main/Customer%20Segmentation.ipynb
[8]https://github.com/dekseniya/Online-Retail-Dataset/blob/master/Online%20Retail%20Dataset%20solution%20(EDA%20%2BRFM).ipynb
[9]https://www.analyticsvidhya.com/blog/2021/07/customer-segmentation-using-rfm-analysis/