Customer Lifetime Value(CLTV) Prediction Using BG/NBD and Gamma Gamma Models in Python

Published in

Akkim Akademi

7 min readDec 28, 2022

Customer Lifetime Value(CLTV) is an important metric in marketing. CLV is the predicted total amount of money a customer is expected to spend on your products. It gives your company a sensible view of the financial value provided by your customers. CLTV can be used as a KPI and as a loyalty measure. CLTV can be calculated based on the existing situation of your customers. You can also predict CLTV for the selected time periods in the future. For CLTV prediction, either statistical models or deep learning models can be used. I will use statistical models like BG/NBD and Gamma Gamma in this project. As a main package I will be using Python-lifetimes package. Firstly I will upload the necessary libraries and then read the data.csv file. You can reach the Kaggle dataset here.

import datetime as dt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lifetimes import BetaGeoFitter
from lifetimes import GammaGammaFitter
from lifetimes.plotting import plot_period_transactions
from lifetimes.utils import summary_data_from_transaction_data
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 500)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
import seaborn as sns

Data Preprocessing

When I look at null values, Description and CustomerID features has many. I will not use the Description feature so it is not important. But I will calculate CLTV specific to CustomerIDs so null values in this feature are meaningless. I will drop the null rows from the dataset.

Invoice Date feature has object data type, so I will change it to datetime. Amount and Price values also have to be positive values.

df = pd.read_csv('data.csv', encoding= 'unicode_escape')
df.columns = ["INVOICE_ID", "STOCK_ID", "DESCRIPTION","AMOUNT", "INVOICE_DATE","PRICE","COMPANY_ID","COUNTRY"]
df =df[["INVOICE_ID", "STOCK_ID","AMOUNT", "INVOICE_DATE","PRICE","COMPANY_ID"]]
df.dropna(inplace=True)
df["INVOICE_DATE"] = pd.to_datetime(df["INVOICE_DATE"])
df = df[(df["AMOUNT"] > 0) & (df["PRICE"]>0)]

The dataset has Price and Amount numeric features. I will take a look at outliers in these features with histogram graph.

sns.boxplot(df["PRICE"], orient="h",notch=True, showcaps=False)
sns.boxplot(df["AMOUNT"], orient="h",notch=True, showcaps=False)

To get more meaningful boxplot graphs, we have to get rid of outliers. Dealing with outliers is a necessary process for data analysis. Outliers, whether they are too big or too small compared to expected values, have big impact on statistical values like mean and standard deviation. Thus for better prediction and analysis, we have to examine outlier values. There are a few ways to deal with outliers, but for this project I will replace outliers with threshold values that are evaluated in (0.05–0.95) range quantiles. I will hold down outlier values and replace them with their threshold values.

def outlier_thresholds(dataframe, variable):
    quartile1 = dataframe[variable].quantile(0.05)
    quartile3 = dataframe[variable].quantile(0.95)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    return low_limit, up_limit

def replace_with_thresholds(dataframe, variable):
    low_limit, up_limit = outlier_thresholds(dataframe, variable)
    dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit
replace_with_thresholds(df, "AMOUNT")
replace_with_thresholds(df, "PRICE")

After replacing outliers with threshold values, let’s look at the boxplots again.

Now the data distributions for Price and Amount features make more sense for our CLTV project.

I need an ending date for CLTV calculation. I will find the max date in my dataset, add 2 days to this date and will use it as the ending date in calculation. We need Total Price feature to be able to calculate cltv.

from datetime import timedelta
df["TOTAL_PRICE"] = df["AMOUNT"] * df["PRICE"]
today_date= df.INVOICE_DATE.max() + timedelta(days=2)
today_date

Output : Timestamp('2011-12-09 12:50:00')

Frequency, Recency, Tenure and Monetary values

We need frequency, recency, tenure and monetary values for further calculations. We can calculate these values in a few steps but lifetimes library has already a function to calculate them easily. summary_data_from_transaction_data function gives us a dataframe with frequency, recency, T and monetary_value columns. The function has “freq” feature that lets you to set a frequency. Options for “freq” are “D” for daily, ”W” for weekly, “M” for monthly and “Y” for yearly. I will set it to “W” to get weekly results.

frequency: How often a customer makes a purchase

recency: How recently a customer has made a purchase

T(tenure) : The customer age(for how many years has the customer been our customer)

monetary_value : Money value of the purchases

cltv_df = summary_data_from_transaction_data(df, 'COMPANY_ID','INVOICE_DATE','TOTAL_PRICE', 
                                                 include_first_transaction=True,
                                                 freq="W",
                                                observation_period_end = today_date)
cltv_df.describe().T

Modeling and Evaluation

BG-NBD stands for “Beta Geometric/Negative Binomial Distribution”. The BG/NBD model is based on the historical purchase behavior of each customer to forecast his future activity.

bgf = BetaGeoFitter(penalizer_coef=0.001)
bgf.fit(cltv_df['frequency'],
        cltv_df['recency'],
        cltv_df['T'])

ggf = GammaGammaFitter(penalizer_coef=0.01)
ggf.fit(cltv_df['frequency'], cltv_df['monetary_value'])

cltv = ggf.customer_lifetime_value(bgf,
                                   cltv_df['frequency'],
                                   cltv_df['recency'],
                                   cltv_df['T'],
                                   cltv_df['monetary_value'],
                                   time=4,  # 1 yıllık
                                   freq="W",  # T'nin frekans bilgisi, haftalık:W
                                   discount_rate=0.01)
cltv=cltv.reset_index()
cltv_final= cltv_df.merge(cltv, on="COMPANY_ID", how="left")
cltv_final["COMPANY_ID"] =cltv_final["COMPANY_ID"].astype(int)
cltv_final.sort_values('clv', ascending=False).head()

Segmentation according to CLTV values

I will create segments based on CLV values.

cltv_final["segment"] = pd.qcut(cltv_final["clv"], 4, labels=["D", "C", "B", "A"])
cltv_final[cltv_final["segment"] =="A"].sort_values('clv', ascending=False).head()

cltv_final[cltv_final["segment"] =="B"].sort_values('clv', ascending=False).head()

Further Insights

Expected number of purchases with BG-NBD Model

When we fit BetaGeoFitter model to the dataset, it will be enough for further steps of CLTV prediction. But I will apply some other functions on this model to have more insights about future purchases.

from lifetimes.plotting import plot_frequency_recency_matrix
plot_frequency_recency_matrix(bgf,max_frequency=50, max_recency=50)

from lifetimes.plotting import plot_probability_alive_matrix
plot_probability_alive_matrix(bgf)

The customers who have higher recency and higher frequency are expected to purchase more products. As frequency increases if recency is low, the probability of being alive is also low, if recency is high then the probability of being alive is higher. Actually to see the probability of being alive specific to each customer will be clarifying the second graph better. Thus I will look into the latency measure.

Purchase Latency

Purchase latency is a measure of the number of days between a customer’s orders. After the first order p_alive(probability of being alive) starts to occur for each customer. The average of the elapsed time between orders is recalculated for each order, resulting in a different p_alive.

If a customer orders after a long time, that is much more than the previous average time interval, the rate of decrease of p_alive value decreases in the next long interval. In other words the probability of being alive decreases more slowly. And then the average time interval and probability of being alive function is calculated again. This holds on again and again as long as customer goes on puschasing. As you will see the situation for the customer with COMPANY_ID = 12352 below.

from matplotlib.pyplot import figure
from lifetimes.plotting import plot_history_alive
figure(num=None, figsize=(10, 4), dpi=80, facecolor='w', edgecolor='k')
days_since_birth = 700
plot_history_alive(bgf, days_since_birth, df.loc[df["COMPANY_ID"] == 12352.00], 'INVOICE_DATE')

Next, I will calculate expected number of purchases by using BG/NBD model and add it as a column to my dataframe.

cltv_final["expected_purc_1_month"] = bgf.conditional_expected_number_of_purchases_up_to_time(4,
                                               cltv_final['frequency'],
                                               cltv_final['recency'],
                                               cltv_final['T'])

Testing purchase predictions with holdout group

from lifetimes.utils import calibration_and_holdout_data
from lifetimes.plotting import plot_calibration_purchases_vs_holdout_purchases

summary_cal_holdout = calibration_and_holdout_data(df, 
                                                   'COMPANY_ID', 
                                                   'INVOICE_DATE',
                                                    calibration_period_end='2011-08-01',
                                                    observation_period_end=today_date)
bgf.fit(summary_cal_holdout['frequency_cal'], 
       summary_cal_holdout['recency_cal'], 
      summary_cal_holdout['T_cal'],
       tol=1e-04)

plot_calibration_purchases_vs_holdout_purchases(bgf, summary_cal_holdout)

GAMMA-GAMMA Model

I will use Gamma-Gamma Model’s conditional_expected_average_profit function that calculates predicted average profit per transaction for each customer

cltv_final["expected_average_profit"] = ggf.conditional_expected_average_profit(cltv_final['frequency'],
                                                                               cltv_final['monetary_value'])
cltv_final.sort_values("expected_purc_1_month", ascending=False).head()

Conclusion

In conclusion, I have predicted cltv, expected number of purchases and expected average profit for each curtomer for the next 4 weeks. Being aware of your best customers and predicting who is likely to churn might be useful to create different marketing strategies for different segments of customers. This model is quite simple to apply when compared to machine learning models but thus it has some disadvantages. For making predictions machine learning and deep learning models will work with better performance. But for CLTV calculation, customer segmentation and purchase latency measure this model may be among the options.

You can find the Jupyter Notebook for this article here