RFM Model Explanation: How to Turn Historical Data Into Future Sales.

Learn how to predict customer spending using historical transactions with the RFM model.

Asish Biswas
AnalyticSoul
5 min readJun 11, 2024

--

Welcome to the final lesson in our chapter on Customer Lifetime Value and Revenue Prediction! In this lesson, we will predict the quarterly spending of a customer from historical transactions. We will use the CDNOW dataset, which contains the historical online purchases of compact disks, and apply the famous RFM (recency, frequency, and monetary) model to predict quarterly sales numbers.

Feature details

Here is the detail of the dataset.

RFM model

RFM model is a method used in marketing to analyze and segment customers based on their historical purchasing behavior. The three elements of the RFM model are:

  • Recency: How recently a customer made a purchase.
  • Frequency: The number of purchases made by a customer in a given time period.
  • Monetary: The total amount spent by a customer.

Data processing

Let’s first load the dataset, and change the data type of the column date from string to datetime. Then we calculate the revenue from quantity and unit price. We also derive the year_month from the date attribute. This attribute will be used to understand customer transactions per month.

import pandas as pd
import numpy as np

df_cdnow_tr = pd.read_csv('data/cdnow.csv', index_col=0)

# fix the data type and parse datetime
df_cdnow_tr['date'] = pd.to_datetime(df_cdnow_tr['date'])

# calculate new attributes
df_cdnow_tr['year_month'] = df_cdnow_tr['date'].dt.to_period('M')
df_cdnow_tr['revenue'] = df_cdnow_tr['price'] * df_cdnow_tr['quantity']

df_cdnow_tr.head()
CD Now Dataset

Method

Our idea is to split the dataset and keep 90 days of transactions as the target and use the rest of the transactions as our training data. Then we’ll use the training data to train a regression model and try to predict last quarter’s sales as well as if a customer will make a purchase.

(Image by author)

Step 1: Cutoff date calculation

Split the transactions before and after the cutoff date, and prepare the target dataset. Then we aggregate the total amount of sales per customer. We will also set a flag to indicate that the customer was active in the last quarter. We define the days in a quarter as 90, and we want to predict quarterly sales per customer. Then we split the main dataset into two based on our cutoff date. Then we aggregate the last quarter’s total revenue per customer and also assign a flag for those customers as active.

# earliest and latest transaction dates
print('earliest date: ', min(df_cdnow_tr['date']))
print('latest date: ', max(df_cdnow_tr['date']))

n_days = 90
cutoff_date = max(df_cdnow_tr['date']) - pd.to_timedelta(n_days, unit='d')

# split the original dataset into two using the cutoff date
df_in = df_cdnow_tr[df_cdnow_tr['date'] < cutoff_date]
df_out = df_cdnow_tr[df_cdnow_tr['date'] >= cutoff_date]

df_targets = (
df_out
.drop(columns=['quantity'])
.groupby('customer_id')
.sum('revenue')
.rename({'revenue': 'spend_90_days'}, axis=1)
.assign(spend_90_flag=1)
.reset_index()
)
print(df_targets.head())

Step 2: RFM base preparation

Now, we’ll process the transactions before the cutoff date. We will calculate the recency, frequency, and monetary values.

We aggregate the transactions before the cutoff date using the groupby() function and calculate the first and last transaction dates. At the same time we calculate the number of unique transaction dates, which is our frequency. We also aggregate the total revenue which is our monetary value. Then we calculate the recency using the cutoff date and last transaction date.

# split the original dataset into two using the cutoff date
df_in = df_cdnow_tr[df_cdnow_tr['date'] < cutoff_date]
df_out = df_cdnow_tr[df_cdnow_tr['date'] >= cutoff_date]

# prepare the base of RFM
df_in = df_in.groupby('customer_id').agg({
'date': ['min', 'max', 'nunique'],
'price': 'sum'
})

# column rename + calculate FM
df_in.columns = ['_'.join(col).strip() for col in df_in.columns.values]
df_in.rename(columns={
'date_min': 'first_transaction',
'date_max': 'last_transaction',
'date_nunique': 'frequency',
'price_sum': 'monetary'
}, inplace=True)

# calculate R (recency)
df_in['recency'] = (cutoff_date - df_in['last_transaction']).dt.days
df_in = df_in.drop(columns=['first_transaction', 'last_transaction']).reset_index()

print(df_in.head())

Step 3: Merge data frames

We merge the RFM dataset with the target dataset where the last quarter’s total revenue is stored. This is going to be the base dataset for our regression model.

df_features = df_in.merge(df_targets, how='left', on='customer_id')
df_features.fillna(0, inplace=True)

print(df_features.head())

Predict future sales

Now that our base dataset is ready, let’s use a linear regression model to predict the next quarter’s revenue per customer.

# seperate the features and the target label
X = df_features[['frequency', 'monetary', 'recency']]
y = df_features['spend_90_days']

# split train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# build and train the lr model
model = LinearRegression()
model.fit(X_train, y_train)

# predict
y_pred = model.predict(X_test)

print("Root mean squared error: %.2f" % mean_squared_error(y_test, model.predict(X_test), squared=False))
  • We identify the target label and the feature columns.
  • We split the training and testing datasets.
  • We build and train a linear regression model.
  • We predict with the test dataset.

The model’s prediction is quite good. The RMSE is around 424, which is not bad.

Visualize results

Let’s have a look at how predictions compare with actual revenue for the last quarter with a histogram.

predictions = pd.DataFrame({'y_test': y_test.array, 'y_pred': y_pred})
predictions = predictions[(predictions['y_test'] > 0) & (predictions['y_test'] < 1000)]

# plot predictions against the actual value
plt.figure(figsize=(8, 6))
sns.histplot(data=predictions, x='y_test', kde=True, color='blue', alpha=0.4)
sns.histplot(data=predictions, x='y_pred', kde=True, color='red', alpha=0.4)

plt.legend(['y_test', 'y_pred'])
plt.ylabel('Revenue')
plt.show()

Learn more by coding along the Notebook.

Congratulations on reaching this far. We have covered a range of important topics in the field of marketing analytics and machine learning. You have learned customer revenue prediction with linear regression, customer segmentation, churn prediction with logistic regression, and customer lifetime value prediction. We hope you enjoyed this course and learned about applied machine learning in marketing analytics. Thank you for taking part in this journey.

Recap

Join the community

Join our vibrant learning community on Discord! You’ll find a supportive space to ask questions, share insights, and collaborate with fellow learners. Dive in, collaborate, and let’s grow together! We can’t wait to see you there!

Thanks for reading! If you like this tutorial series make sure to clap (up to 50!) and let’s connect on LinkedIn and follow me on Medium to stay updated with my new articles.

Support me at no extra cost by joining Medium via this referral link.

--

--