A Bayesian Approach to Customer Lifetime Valuation

6 min readJul 30, 2023

Introduction

After looking at causal inference with PyMC last week, I wanted to turn towards another domain where Bayesian methods can be very helpful: Marketing.

We will look towards a common problem in marketing which is that of Customer Lifetime Valuation (CLV). We can also understand this as estimating the economic worth of a customer to the company. More precisely, we want to estimate the net cash flow (gains-cost) that a customer will accumulate throughout their lifetime. In this case, lifetime refers to the time that the customer will spend with the company. We could express CLV in the following way where CF_i(t) is the cashflow that customer i will yield in period t. T_i is the length of the lifetime of the customer. Note that we are always splitting up time into periods (e.g. months or weeks). Since money that we will only have in the future is less valuable we have to discount future cashflows. This makes sense intuitively, as $100 now are more valuable to us than $100 a year from now.

Formula by author, adapted from(link)

There are now two main unknowns for each customer. Firstly, a priori, we don’t know T(i), i.e. how long a customer will stay with the company. Secondly, we don’t know, how much revenue they will earn the company each year (the CF function). The former problem is worse if we have a non-contractual setting. This means that the customer can simply stop buying and doesn’t have to terminate a conctract if he wants to stop being a buyer. This means that to the company, T cannot be pinned down to an exact point in time.

Any model will first need features from which it can estimate the CLV. These can be extracted from their transaction history. Many models use the following attributes:

Frequency: Number of repeat transactions that a customer made (Number of transactions -1)
Recency: Number of time periods that have gone by since the last transaction of the customer.
Age: Number of time periods between the first transaction and the present.
Monetary Value: The average monetary value of a transaction.

Note that I am using ‘purchase’ and ‘transaction’ interchageable here, we can see transaction as a generalization of a purchase. This is in case we want to use a CLV model in some other setting.

The Dataset

I have taken the Brazilian E-Commerce Public Dataset by Olist from Kaggle. From my understanding, Olist is a platform similar to Ebay, that serves as a marketplace for sellers to sell stuff to customers.

In our case, we will try to valuate the sellers of the platform instead of the customers of the sellers. This is because, most of the customers only have made 1 or 2 orders (and therefore 0 or 1 repeat purchases), making analysis difficult. As it is oftentimes the case with real world data, the data is not already in one table for us to begin the analysis. Instead, we have to collect it from three different tables. In the payments table, we can see what the transaction value was. In the orders table, we can see what customers of the seller bought and when. In the order items table, we can see who the seller of the order was. For sake of brevity I will not include the preprocessing code here, instead, you can find it in my GitHub.

Once we have our tables joined and removed unneccessary columns, we can aggregate by sellers. Note that we are calculating the exact attributes that were described in the previous section. I am taking the latest purchase date as the ‘present’ as the data is already 2 years old. If we didn’t do so before, we should also filter out sellers with no repeat purchases (i.e. those that only sold something once).

data = merged.groupby('seller_id').agg(
    monetary_value=('payment_value', 'mean'),
    frequency=('order_id', lambda x: len(x)-1),
    recency = ("order_purchase_timestamp", 
      lambda x: (merged['order_purchase_timestamp'].max() - x.max()).days),
    T = ("order_purchase_timestamp", 
      lambda x: (merged['order_purchase_timestamp'].max() - x.min()).days)
)

data = data.loc[data.frequency!=0]

From the histograms we can also see that especially frequency and monetary value are very fat-tailed. This could make inference hard for our models. We will therefore take the logarithm for these 2 features.

data["log_freq"] = np.log(data["frequency"]
data["log_monetary"] = np.log(data["monetary_value"]

Evidently, the distribution of these 2 features are now much more well-behaved.

Modelling

We will start with using the BG/NBD model from PyMC Marketing. Note that this model can’t estimate the CLV directly but rather estimates how many purchases will happen in the time periods to come. I will not cover the model in-depth here, but the paper that introduced it also explains it pretty well. At a high level, this model assumes that, while a customer is still ‘alive’, transactions follow a poisson distribution with rate lambda. After each transaction, the customer could become inactive with a probability p. Both p and lambda in turn have their own distribution from which they can be drawn. This is a Beta distribution for p and a Gamma distribution for lambda.

beta_geo_model = clv.BetaGeoModel(
    customer_id=data.index,
    frequency=data["log_freq"],
    recency=data["recency"],
    T=data["T"],
    a_prior=pm.HalfNormal.dist(10),
    b_prior=pm.HalfNormal.dist(10),
    alpha_prior=pm.HalfNormal.dist(10),
    r_prior=pm.HalfNormal.dist(10),
)

Setting up this model is pretty straight forward, we only have to tell PyMC where to find the relevant features in our dataset. Note that similar to the example notebook on the PyMC page I had to set the priors to HalfNormal distributions instead of flat priors as this yielded better results.

Model Troubleshooting

Bayesian modelling is a highly iterative process in which you often have to repeat the loop of changing mopel structure/parameters and looking at results/diagnostics multiple times.

In this case, using the logarithm of frequency instead of the ‘raw’ frequency worked very well as the model did have many divergences without the transformation. Also, switching from Flat to HalfNormal priors had a great effect on the time it took to run the model (order of magnitude faster).

Model Results

The plot above shows a heatmap which specifies, how many future transactions we can expect from a customer, given that they have a specific frequency and recency. This is a good moment for checking, if the results match up with what we would expect. Generally speaking, sellers with a high historical frequency are also expected to have many future transactions. Note that in this case, we have implicitly set the ‘unit of time’ to one day hence these relatively small numbers. We can also look at the parameters that the model inferred.

The parameters ‘a’ and ‘b’ govern the shape of the beta distribution from which we draw p.

a, b = 22.4, 47

x = np.linspace(beta.ppf(0.01, a, b),
                beta.ppf(0.99, a, b), 100)
fig, ax = plt.subplots(1,1)
ax.plot(x, beta.pdf(x, a, b), alpha=0.6)

If we simply plug in the mean value obtained for a and b we get the following beta distribution. Note that the probability p of a customer becoming inactive varies. From this distribution we can, however, see that it varies from 0.2 to 0.45.

Conclusion

Bayesian methods have proven to be very helpful in the context of marketing analytics. This has multiple reasons. Although data is oftentimes ‘big’ on the group-level, when we come down to the level of a singular individual (e.g. a single customer) we oftentimes only have a few observations (e.g. 2–3 purchases). Bayesian modelling can be especially valuable here by means of hierarchical modelling. Additionally, Bayesian methods are well suited for reflecting uncertainty in the data.

This was just supposed to be a short introduction into CLV estimation with PyMC marketing. I am planning to expand on this in future weeks so stay tunned!