RFM Analysis

9 min readMar 5, 2023

|Step by Step Customer Segmentation Project on FLO Dataset|

This is an article where I share the details of Customer Segmentation Project with RFM Analysis. To better understand, you can first view my Customer Segmentation with RFM article where I discussed basics of RFM and explained another project.

Now that we know what RFM is, how to perform it on datasets, and how to make conclusions; let’s use it on a different real-world dataset. As always, you can check the code on my GitHub Profile.

1.Business Problem:

An online shoe store (FLO) wants to segment their customers and define marketing strategies according to these segments. For this purpose, behaviors of customers will be defined and groups will be formed according to clusters in behaviors.

2.Understanding and Preparing the Data

Dataset includes information obtained from the past shopping behavior of OmniChannel (both online and offline shoppers) customers whose last purchases was in 2020–2021.

Primary adjustments (importing the libraries, setting data display preferences, and reading the dataset):

import pandas as pd
import datetime as dt

pd.set_option('display.max_columns', None) # to see all the columns

df_ = pd.read_csv("csv file location")
df = df_.copy() # for better performance

Getting to know the data, what do we have to analyze?

    df.head() # First 5 rows:
    df.columns # Column names:
    df.describe().T # Descriptive statistics
    df.shape
    df.isnull().sum() # Empty values
    df.dtypes # Type of the variables
    df["master_id"].nunique() # To check if the customer ids repeat themselves or not

Well, column names are self-explanatory but still let’s be clear:

master_id: Unique customer ID
order_channel : The platform that is used for the purchase (Android, ios, Desktop, Mobile, Offline)
last_order_channel : The platform that is used for the last purchase
first_order_date : Date of the first purchase
last_order_date : Date of the last purchase
last_order_date_online : Date of the last online purchase
last_order_date_offline : Date of the last offline purchase
order_num_total_ever_online : Total number of online purchases
order_num_total_ever_offline : Total number of offline purchases
customer_value_total_ever_offline : Total money spent on offline purchases
customer_value_total_ever_online : Total money spent on online purchases
interested_in_categories_12 : Category list that the customer purchased from in the last 12 months

Now, we have a better understanding. There are 19945 rows for each customer. To make sure customer ids do not repeat themselves, number of unique ids is checked, and it is same with the row number. It seems that there is no null value in dataset, so we are good for now:)

The aim is to perform RFM analysis, therefore we need to prepare the dataset in this way. One thing that bothers, dates are in object type, so it needs to be changed. Also, customers are both online and offline shoppers so order number and customer value data need to be combined on customer basis.

Total number of orders will be frequency metric for each customer.

Total customer value will be monetary metric for each customer.

#Customer's total number of purchase:
df["total_order_number"] = df["order_num_total_ever_online"] + df["order_num_total_ever_offline"]
# Customer's total spending:
df["total_spending"] = df["customer_value_total_ever_online"] + df["customer_value_total_ever_offline"]

# we can perform mathematical operations on columns.
# by assigning a nonexisting column name to a operation, a new column is created. 

# Converting date variables type into date type:
df['first_order_date'] = pd.to_datetime(df['first_order_date'])
df['last_order_date'] = pd.to_datetime(df['last_order_date'])
df['last_order_date_online'] = pd.to_datetime(df['last_order_date_online'])
df['last_order_date_offline'] = pd.to_datetime(df['last_order_date_offline'])

# to_datetime function from Pandas library is used on date columns 
# so that we can perform operations on dates to find the recency of each customer.

df.dtypes
# check type

We may need some insights on data such as:

What is the breakdown of the number of customers, average number of products purchased, and average spend across shopping channels?
Who are the top 5 customers with the highest revenue?
Who are the top 5 customers with the most orders?

How to answer these questions:

# Breakdown
a = df.groupby(["order_channel"]).agg({"master_id": "count", "total_order_number": "mean","total_spending": "mean"})
a.columns = ['number of customers', 'average number of products purchased', 'average spend']

# While groupby function make us able to group all the data on one conditional column,
# with agg. we can specify the operations needed on different columns seperately. 
# Then column names are changed to make them more understandable.

# Top 5 customers with the highest revenue

df[["master_id", "total_spending"]].sort_values(by = "total_spending", ascending= False).head()

# Top 5 customers with the most orders

df[["master_id", "total_order_number"]].sort_values(by = "total_order_number", ascending= False).head()

# With double brackets, spesific columns are extracted and sorted.

# sort_values function sorts the values in ascending order by default. 
# Since the top customers are needed, ascending= False have to be specified.

Now we know what the most common order channel that is preferred by the customers is, who the most frequent customers are, and which customers bring the most monetary value.

We may need to repeat this process on other datasets with same columns, maybe for another time scale, therefore a function is created just for the preparation part. Thus, with this function we can know another dataset in seconds. Later on, RFM analysis will be functionalized too.

This part is a little bit long; as always, you can check the code on my GitHub Profile.

def preliminary(dataframe, do_print=False):

    ***

    if do_print:

        ***
    
    return breakdown, top10renevue, top10order

breakdown, top10renevue, top10order = preliminary(df, do_print=False)

Here, I have added a condition to print. Function returns the breakdown, top10renevue, top10order data frames, however if you wish to access the other statistics you can just call preliminary(df, do_print=True).

Let’s continue with the RFM analysis.

3.Calculation of RFM Metrics

Now that the data frame is ready, RFM Metrics can be calculated.

After this point, all the operations can be done differently, maybe more efficiently. Feel free to comment for corrections or advice:)

df["last_order_date"].max()
# Data is not up to date, by checking the last data date a date is established to perform the analysis.

today_date = dt.datetime(2021, 6, 1)
# Date of the analysis

rfm = df.groupby('master_id').agg({'last_order_date': lambda last_order_date: (today_date - last_order_date).astype('timedelta64[D]').astype(int),
                                    "total_order_number": sum,
                                    "total_spending": sum})

# .astype('timedelta64[D]') function converts datetime to a number in days 
# but as float so I also added .astype(int) to lose the decimals.

# With agg., we can perform built-in functions 
# or we can define new functions to perform on columns.

rfm.columns = ['recency', 'frequency', 'monetary']

rfm.describe().T
# to check if there is an unusual situation

rfm.shape

Note that, master id is not in columns anymore since we used groupby. It is index in RFM table.

4.Calculation of RFM Scores

Customers are ranked inside according to metrics and these ranks are their scores.

rfm["recency_score"] = pd.qcut(rfm['recency'], 5, labels=[5, 4, 3, 2, 1])
# recency metrics divided 5 equal parts acoording to their sizes and
# labelled such a way that greatest recency got 1, the smallest recency got 5.

rfm["frequency_score"] = pd.qcut(rfm['frequency'].rank(method="first"), 5, labels=[1, 2, 3, 4, 5])
# the problem in frequency is some numbers are too repetitive that
# qcut function can not label the same frequency number diffently
# rank method solves this problem by assigning the first comer number to first label.

rfm["monetary_score"] = pd.qcut(rfm['monetary'], 5, labels=[1, 2, 3, 4, 5])

rfm["RFM_SCORE"] = (rfm['recency_score'].astype(str) +
                    rfm['frequency_score'].astype(str))
# by RFM definition, string is created with recency and frequency score
# and formed final RFM Score
# monetary score is necessary for observation, but it is not used in forming RFM Score

5.Scores to segments

Segment names are added according to Table 1.

seg_map = {
    r'[1-2][1-2]': 'hibernating',
    r'[1-2][3-4]': 'at_Risk',
    r'[1-2]5': 'cant_loose',
    r'3[1-2]': 'about_to_sleep',
    r'33': 'need_attention',
    r'[3-4][4-5]': 'loyal_customers',
    r'41': 'promising',
    r'51': 'new_customers',
    r'[4-5][2-3]': 'potential_loyalists',
    r'5[4-5]': 'champions'
}
#  Basically, the R-F table is coded using regex.

rfm['segment'] = rfm['RFM_SCORE'].replace(seg_map, regex=True)

RFM data frame is changed repeatedly. Let’s see what it has become.

RFM analysis is finished on customer basis, now it is time use this frame to make observations, answer some questions, use it to become better at relationships with the customers.

ACTION!

Examine the RFM frame segment wise.

segment_analysis = rfm.groupby('segment').agg({'recency': "mean",
                                                "frequency": "mean",
                                                "monetary": "mean",
                                                "segment": "count"})
segment_analysis.columns = ['recency(mean)', 'frequency(mean)', 'monetary(mean)', "count"]

For instance, most frequent customers are in cant loose segment. Even though they were frequent, they seem to be on the edge of stopping to buy staff on this shop according to their recency metric. Noticing this may make company to be able to define specific marketing strategies for these customers.

2. With RFM analysis, find the customers in the relevant profile for 2 cases and save the customer IDs to csv.

FLO includes a new women’s shoe brand. The product prices of the brand it includes are above the general customer preferences. For this reason, customers in the profile who will be interested in the promotion of the brand and product sales are requested to be contacted privately. From their loyal customers (champions,loyal_customers), the customers who shop in the women category with an average of 250 TL or more will be contacted privately. Import Id numbers of these customers to csv file and save it as new_brand_target_customer_id.cvs.

Category info and (segment and monetary) info are on different data frames. So, I have two solutions to solve this problem.

First solution:

Two lists of IDs will be found, and their intersection is going to be extracted:

loyals_over250_id = list(rfm[(((rfm['segment'] == "champions") 
                            | (rfm['segment'] == "loyal_customers")) 
                            & (rfm['monetary'] > 250 ))].index)

# Master IDs are exctracted from rfm table according to needed segments 
# and monetary condition, since Master IDs are index, index function is used
# then converted to a list to easiness of upcoming operations.

df["interested_in_categories_12"] = df["interested_in_categories_12"].astype(str)
women = list(df[df["interested_in_categories_12"].str.contains("KADIN")]["master_id"])

# Master IDs are exctracted from df table according to needed category 
# then converted to a list

new_brand_target_customer_id = pd.DataFrame(list(set(loyals_over250_id).intersection(women)))

# intersection of these lists is taken as target customer group
# then, converted to dataframe using pandas to be able to save it as csv file

new_brand_target_customer_id.columns = ["new_brand_target_customer_id"]
# saved as csv.

Second solution:

After first solution, I thought maybe I would not need 2 tables if I were to manipulate the rfm table to have category information.

This part is a repetition of preparing RFM table with one difference: RFM2 table have category column.

rfm2 = df.groupby('master_id').agg({'last_order_date': lambda last_order_date: (today_date - last_order_date).astype('timedelta64[D]').astype(int),
                                    "total_order_number": sum,
                                    "total_spending": sum,
                                    "interested_in_categories_12": sum})

rfm2.columns = ['recency', 'frequency', 'monetary', "interested_in_categories_12"]
rfm2["recency_score"] = pd.qcut(rfm2['recency'], 5, labels=[5, 4, 3, 2, 1])
rfm2["frequency_score"] = pd.qcut(rfm2['frequency'].rank(method="first"), 5, labels=[1, 2, 3, 4, 5])
rfm2["monetary_score"] = pd.qcut(rfm2['monetary'], 5, labels=[1, 2, 3, 4, 5])
rfm2["RFM_SCORE"] = (rfm2['recency_score'].astype(str) +
                    rfm2['frequency_score'].astype(str))

seg_map = {
    r'[1-2][1-2]': 'hibernating',
    r'[1-2][3-4]': 'at_Risk',
    r'[1-2]5': 'cant_loose',
    r'3[1-2]': 'about_to_sleep',
    r'33': 'need_attention',
    r'[3-4][4-5]': 'loyal_customers',
    r'41': 'promising',
    r'51': 'new_customers',
    r'[4-5][2-3]': 'potential_loyalists',
    r'5[4-5]': 'champions'
}

rfm2['segment'] = rfm2['RFM_SCORE'].replace(seg_map, regex=True)
rfm2["interested_in_categories_12"] = rfm2["interested_in_categories_12"].astype(str)

#######################################################################
second_new_brand_target_customer_id = pd.DataFrame(list(rfm2[((rfm2['segment'] == "champions") | (rfm2['segment'] == "loyal_customers"))
                                              & (rfm2['monetary'] > 250 )
                                              & (rfm2["interested_in_categories_12"].str.contains("KADIN"))].index))

second_new_brand_target_customer_id.to_csv("second_new_brand_target_customer_id.csv")

As seen above, we have come to the same conclusion with both approaches.

Up to 40% discount is planned for Men’s and Children’s products. Customers related to this sale are “can’t loose” customers, those who are asleep and new customers. Enter the ids of the customers in the appropriate profile into the csv file as discount_target_customer_ids.csv

First solution:

Same procedure is followed with first problem.

target_segments_id = list(rfm[(rfm['segment'] == "at_Risk") 
                                | (rfm['segment'] == "hibernating") 
                                | (rfm['segment'] == 'cant_loose') 
                                | (rfm['segment'] == "new_customers")].index)


df["interested_in_categories_12"] = df["interested_in_categories_12"].astype(str)
men_or_child = list(df[df["interested_in_categories_12"].str.contains("ERKEK") 
                     | df["interested_in_categories_12"].str.contains("COCUK")]["master_id"])


discount_target_customer_ids = pd.DataFrame(list(set(target_segments_id).intersection(men_or_child)))
discount_target_customer_ids.columns = ["discount_target_customer_ids"]

discount_target_customer_ids.to_csv("discount_target_customer_ids.csv")

Second solution:

second_discount_target_customer_ids = pd.DataFrame(list(rfm2[((rfm2['segment'] == "at_Risk") | (rfm2['segment'] == "hibernating")
                                                             | (rfm2['segment'] == 'cant_loose') | (rfm2['segment'] == "new_customers"))
                                                             & ((rfm2["interested_in_categories_12"].str.contains("ERKEK"))
                                                             | (rfm2["interested_in_categories_12"].str.contains("COCUK")))].index))

second_discount_target_customer_ids.to_csv("second_discount_target_customer_ids.csv")

# since rfm2 is already defined, target group ids could directly be extracted.

Again, same result with both approaches.

Finally, functionalization of the whole process:

Function returns rfm2 table since apparently, we need the categories that customers ordered from. Forming csv file is an option.

def final(dataframe, csv=False):

        ***

    if csv:
        rfm2.to_csv("rfm.csv")

    return rfm2

final(df)

P.S. Thank you a lot for your time. I like to share my projects on Medium too beside GitHub so that I can see my reasoning over time. Don’t hesitate to leave a comment if you think there is a better way to perform the parts of the analysis:) Best of luck!!

Connect with me on LinkedIn

Connect with me on Medium

Connect with me on GitHub