Data-Driven Customer Segmentation: RFM and CLTV Analysis Using Python

7 min readJul 12, 2023

1.0 RFM Analysis with Python

RFM analysis is a powerful technique widely employed in customer segmentation to gain valuable insights into customer behavior. It leverages three key parameters: Recency, Frequency, and Monetary value.

Recency refers to the time elapsed since a customer’s last transaction or interaction with the business. It helps identify whether customers are active, dormant, or at risk of churning. Frequency measures how often customers engage with the company within a specific timeframe, indicating customer loyalty and engagement levels. The monetary value represents each customer's total amount as a proxy for their lifetime value to the business.

When performing RFM analysis, a database-driven segmentation approach is commonly used. This involves organizing customer data within a database, extracting the necessary RFM metrics, and applying algorithms or techniques to group customers based on their RFM scores. These scores assign numerical values to each customer based on recency, frequency, and monetary matters.

1.1 Dataset

I will use the Online Retail dataset for the RFM segmentation. You can reach the dataset from the following Kaggle link:

Online Retail II Data Set from ML Repository

A real online retail transaction data set of two years.

www.kaggle.com

Dataset column descriptions:

InvoiceNo: Invoice number. Nominal. A 6-digit integral number is uniquely assigned to each transaction. If this code starts with the letter ‘c’, it indicates a cancellation.
StockCode: Product (item) code. Nominal. A 5-digit integral number is uniquely assigned to each distinct product.
Description: Product (item) name. Nominal.
Quantity: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invoice date and time. Numeric. The day and time when a transaction was generated.
UnitPrice: Unit price. Numeric. Product price per unit in sterling (Â£).
CustomerID: Customer number. Nominal. A 5-digit integral number is uniquely assigned to each customer.
Country: Country name. Nominal. The name of the country where a customer resides.

1.2 Calculation of RFM Metrics

For Recency, we find the last purchase activity date of the end user within the data frame and add 1–2 days to it to use it as the analysis date in the format of ‘today date. Then, by subtracting the user’s last purchase date from the ‘today date’ value, we obtain the Recency value in terms of days.

Frequency is essentially the transaction count for each user. The sum of transaction counts for users provides us with their frequency.

Lastly, the Monetary metric represents the total amount of money each user spends. The sum of the user’s payments will be sufficient to obtain this metric.

# Create analysis date for Recency metric
invoice_date = df["InvoiceDate"].max() 

today_date= dt.datetime(2010, 12, 11)

# Create RFM data frame with groupby, use lambda functions for calculating Recency, Frequency, and Monetary metrics
rfm_df = df.groupby("Customer ID").agg({'InvoiceDate': lambda date: (today_date - date.max()).days,
                                        "Invoice": lambda invoice: invoice.nunique(),
                                        "TotalPrice": lambda totalp: totalp.sum()})

# Rename columns for appropriate use

rfm_df.columns = ["recency", "frequency", "monetary"]

When applying RFM segmentation with Python, we convert the RFM metrics into scores because it is impossible to compare these metrics directly.

When converting RFM values into scores, we use the qcut() function to transform the metric values into RFM label values of 1–2–3–4–5.

rfm_df["monetary_score"] = pd.qcut(rfm_df["monetary"], 5, labels=[1, 2, 3, 4, 5])

When scoring the Frequency value, we encounter a problem. In the defined label intervals, there can be cases where a frequency value falls into multiple intervals when it occurs very frequently.

Therefore, in the frequency calculation step, we use the rank(method=”first”) method to select the first occurrence and assign it as the value for that quartile, thus resolving the issue.

rfm_df["frequency_score"] = pd.qcut(rfm_df["frequency"].rank(method="first"), 5, labels=[1, 2, 3, 4, 5])

It is important to remember that the qcut() function performs a division from the smallest to the most significant values. In the case of the Recency metric, a smaller value represents a more recent date and should therefore have a higher score.

rfm_df["recency_score"] = pd.qcut(rfm_df["recency"], 5, labels=[5, 4, 3, 2, 1]) 

# Create final RFM score
rfm_df["RFM_SCORE"] = (rfm_df["recency_score"].astype(str) + rfm_df["frequency_score"].astype(str))

1.3 Creating RFM Segments

In CRM analytics studies, the frequency of customer interactions, in other words, transactions, is more important. This is because a customer who engages with us can generate more revenue by implementing strategies that lead to increased sales. Looking at the monetary value becomes less meaningful for a customer with no frequency or low probability of transactions.

Therefore, we perform the segmentation process by focusing on two dimensions: Recency and Frequency.

In this final step, we input the intervals of RFM segments as ‘regex.

Then, we apply the replace() method to the variable containing the obtained scores, replacing them with the corresponding segments and saving them to the variable where the segments are located, completing our process.

seg_map = {
    r'[1-2][1-2]': 'hibernating',
    r'[1-2][3-4]': 'at_Risk',
    r'[1-2]5': 'cant_loose',
    r'3[1-2]': 'about_to_sleep',
    r'33': 'need_attention',
    r'[3-4][4-5]': 'loyal_customers',
    r'41': 'promising',
    r'51': 'new_customers',
    r'[4-5][2-3]': 'potential_loyalists',
    r'5[4-5]': 'champions'

rfm_df['segment'] = rfm_df['RFM_SCORE'].replace(seg_map, regex=True)

The resulting customer segments generated by RFM analysis can be precious for businesses. Organizations can tailor their marketing strategies, product offerings, and customer experiences' specific needs by grouping customers with similar purchasing patterns, preferences, and behaviors.

For example, high-value customers who recently purchased may receive personalized offers to encourage repeat purchases. Dormant customers with low monetary value may be targeted with re-engagement campaigns to regain their interest.

2.0 CLTV Customer Segmentation with Python

CLTV (Customer Lifetime Value) is a crucial metric that quantifies a customer's monetary worth to a company over the entire duration of their relationship and communication. It encompasses not only the immediate revenue generated from individual purchases but also factors in the potential for future transactions and customer loyalty.

By calculating CLTV, businesses gain insights into the value and profitability of each customer. This knowledge allows them to identify high-value customers that contribute significantly to the company’s revenue. With this understanding, businesses can focus their resources and efforts on nurturing and retaining these valuable customers, maximizing long-term profitability.

2.1 The Calculations Underlying CLTV Segmentation

CLTV can be calculated with the following formula:

CLTV = (Customer Value / Churn Rate) * ProfitMargin

Let’s break down the values in this formulation and closely examine their statistical calculations.

# Customer Value = Average Order Value * Purchase Frequency

# Average Order Value = Total Price / Total Transaction

# Purchase Frequency = Total Transaction / Total Number of Customers

# Number of Customers with Multiple Purchases / Total Customers

# Churn Rate = 1 - Repeat Rate

When the obtained CLTV values are used to rank and divide customers into groups at specific points, a segmentation occurs, and customers are categorized into segments.

2.2 Dataset

I will also use the Online Retail dataset for the CLTV segmentation. You can reach the dataset and my whole CLTV segmentation work from the following Kaggle link:

Customer Segmentation with CLTV

Explore and run machine learning code with Kaggle Notebooks | Using data from E-Commerce Data

www.kaggle.com

2.3 CLTV Calculations

# Find the total price for CLTV calculation

df["Total_Price"] = df["Quantity"] * df["UnitPrice"]

# Create cltv_df with the total price, total transaction, and total unit for calculation

cltv_df = df.groupby("CustomerID").agg({"InvoiceNo": lambda x: x.nunique(),
                             "Quantity": "sum",
                             "Total_Price": "sum"})

# Rename columns for CLTV calculation

cltv_df.columns = ["Total_Transaction", "Total_Unit", "Total_Price"]

Once we create the CLTV data frame, we can calculate all other values using it.

# Calculate the average order value

cltv_df["Average_Order_Value"] = cltv_df["Total_Price"] / cltv_df["Total_Transaction"]

# Calculate purchase frequency

cltv_df["Purchase_Frequency"] = cltv_df["Total_Transaction"] / cltv_df.shape[0]

# Calculate customer value

cltv_df["CV"] = cltv_df["Average_Order_Value"] * cltv_df["Purchase_Frequency"]

# Calculate churn rate

Repeat_Rate = cltv_df[cltv_df["Total_Transaction"] > 1].shape[0] / cltv_df.shape[0]

Churn_Rate = 1 - Repeat_Rate

# Calculate profit margin

cltv_df["Profit_Margin"] = cltv_df["Total_Price"] * 0.10

# Calculate the customer lifetime value

cltv_df["CLTV"] = (cltv_df["CV"] / Churn_Rate) * cltv_df["Profit_Margin"]

2.4 Analyzing the Segments and Finalizing the Segmentation Criteria

# Segmentation based on CLTV

cltv_df["segments"] = pd.qcut(cltv_df["CLTV"], 5, labels= ["D", "C", "B", "A", "S"])

# Descriptive stats for segments

cltv_df.groupby("segments").agg({"count", "mean", "sum"})

After completing the segmentation process, our next step should be to analyze the segments to reach the optimal number of features and obtain segment averages aligned with our business objectives. This stage can be challenging without a deep understanding of business knowledge and project details, as making the best decisions becomes difficult. Therefore, if we are not well-versed in this information, we aim to perform a logical segmentation process by utilizing statistical approaches and data visualization to gather as much information as possible from the data.

For instance, I complete the segmentation process by using a radar chart to visualize the distribution of segments across the data.

# Radar chart

r_segment_stats = cltv_df.groupby("segments")[["Average_Order_Value", "Total_Transaction", "Profit_Margin"]].mean()

# Normalize the data
segment_stats_normalized = (r_segment_stats - r_segment_stats.min()) / (r_segment_stats.max() - r_segment_stats.min())

# Create a radar chart to visualize the statistics by segments
angles = np.linspace(0, 2 * np.pi, len(r_segment_stats.columns), endpoint=False).tolist()
angles += angles[:1]  # Close the plot
plt.figure(figsize=(8, 8))
for segment in segment_stats_normalized.index:
    values = segment_stats_normalized.loc[segment].tolist()
    values += values[:1]  # Close the plot
    plt.polar(angles, values, marker='o', label=segment)
plt.xticks(angles[:-1], r_segment_stats.columns)
plt.yticks([0.25, 0.5, 0.75, 1.0])
plt.title('Segment Comparison')
plt.legend(title='Segments')
plt.show()

Appreciation and Acknowledge

While this article provides the core steps for performing RFM and CLTV customer segmentation using Python, it does not include the Python code for data preprocessing, exploratory data analysis, and other actions that may be part of these processes to keep the article concise. RFM and CLTV segmentation studies highly depend on the available data's the analysis's specific goals. Nevertheless, I have shared the fundamental steps involved in these processes.

I hope you have found this article on customer segmentation using CLTV and RFM with Python helpful.

Please don’t hesitate to reach out with any questions or suggestions, and if you liked it, feel free to give it a clap. Thank you for reading!