Individual Customer Lifetime Value (CLV) with R and Stan

Alex Pavlakis
Ordergroove Engineering
9 min readJul 13, 2018

INTRODUCTION

As Data Scientists at OrderGroove, everything we do is driven by a relationship commerce imperative. By that we mean we seek to remove all friction from customers’ experiences so that retailers become indispensable in the lives of each and every one. As a result, all our models and analyses focus on the purchasing behavior of individuals. It’s not enough to understand customers on average; we need a full picture of what makes each one unique.

This approach can be particularly helpful when thinking about customer lifetime value (CLV), or the total amount customers will spend over their ‘lifetime’ as customers. Most companies that go to the effort of estimating CLV are comfortable forecasting customer spending in aggregate, faceting by acquisition channel or other broad categories, but our goal is to understand the potential spending patterns of individual customers in order to unlock the full potential of relationship-building programs. The approach we’ll outline in this post relies on the work of researchers such as OrderGroove advisor Peter Fader and advances in probabilistic programming to estimate full distributions of potential value for every single customer in a dataset. Using Stan, a language for Bayesian inference accessible through R and Python, we can estimate CLV with individual-level granularity.

BACKGROUND

Many companies that go through the trouble of understanding their customers through statistics like CLV look at broad averages based on geography, sales channels, or demographics, which can lead to large unseen biases. Companies that ignore individual level variation in CLV are likely to over- or underestimate their customers’ future spending. This can lead to an incorrect picture of the health of the business, resulting in misallocated resources, misguided strategies, and missed revenue opportunities.

Customers exhibit a wide range of spending behavior, from one-and-done transactions to long-term relationships or subscriptions. Understanding and leveraging those differences lets companies build and protect customer relationships and increase market share. For example, if two companies A and B have the same average customer churn rate, but company A has a mix of very loyal customers and one-and-done customers, and all of company B’s customers are homogenous, company A can be much more valuable — if they recognize this and act upon it to their advantage. At OrderGroove, we work with hundreds of merchants, so thinking about merchant and customer-level heterogeneity is important for everything we do.

INDIVIDUAL APPROACH

We illustrate one approach to CLV modeling with a simplified data set including only purchase amounts and dates. This event-log style data is available to most ecommerce companies, and is all that is needed to estimate a baseline CLV model.

The math behind the model can get a little tricky, but the intuition is simple. We break each customer’s spending into three component parts:

  1. Transaction rate, or the number of transactions a customer makes in a given time period;
  2. Dropout rate, or the probability that a customer stops shopping over a given time period; and
  3. Average spending, or a customer’s average transaction amount.

One advantage of this model is that these intermediate results can be useful on their own. For instance, dropout rates provide quick survival analyses that give us a sense of which customers are at the highest risk of ‘churning’. If a customer with a high transaction rate or high spend-per-transaction is at risk of churning, we may want to put in some extra effort to keep their business.

In this post, we’ll use Stan to estimate all three of these quantities and put them together into a baseline estimate of CLV. Stan produces these estimates with Hamiltonian Monte Carlo (HMC) and provides full distributions of all parameters and quantities of interest so we can naturally quantify our uncertainty around CLV for each customer. For this example, we’ll use a sample from the cdnow dataset, which is available online and often used for building and evaluating CLV models. It tracks the repeat spending behavior of a set of customers who bought something in early 1997.

The plot below displays purchase timelines for the dataset. Each dot represents a transaction; the x-axis shows when it occurred in time, and the size of the dot shows its relative amount. Some customers have been making transactions regularly throughout the entire period; others have made few transactions sparsely. Some customers tend to spend a lot in each transaction; others generally spend little.

In order to estimate distributions for each customers’ transaction rate, dropout rate, average spending, and lifetime value, we use the following Bayesian model of customer lifetime value based on Fader and Hardie (2003 and 2013).

We have the following data on each customer:

  • x: total repeat transactions (frequency);
  • t_x: time of most recent transaction (recency);
  • t_cal: time since first transaction; and
  • mx: average transaction amount.

We define the following relationships for each customer

  • P(X(t) = x) ~Poisson(λt). While active, the number of transactions a customer makes in a period of length t follows a Poisson distribution with rate parameter λ;
  • τ ~ Exponential(μ e^(-μτ)). Each customer has an active period of length τ, which follows an Exponential distribution with dropout rate μ.
  • mx ~Gamma(px, vx). Each customer’s transaction amount follows a Gamma distribution with shape parameter p and scale parameter v. This means that each customers average transaction amount follows a Gamma distribution with shape parameter px and scale parameter vx.

We define the following relationships for all customers:

  • λ ~ Gamma(rr, α). All customers’ transaction rates are drawn from a common Gamma distribution;
  • μ ~Gamma(ss, β). All customers’ dropout rates are drawn from a common Gamma distribution;
  • v ~ Gamma(q, y). The scale parameter of spending distributions are drawn from a common Gamma distribution.

Below is a full implementation of this model in Stan. We take advantage of all of Stan’s “blocks” to declare and transform our data and parameters, define our prior and likelihood functions, and generate simulations from our model.

In the functions block, we combine the transaction rate and dropout rate distributions into a custom likelihood function.

// Likelihood function
functions {
vector llh(vector x, vector t_x, vector t_cal,
vector lambda1, vector mu1,
vector log_lambda, vector log_mu,
vector log_mu_lambda, int N) {
vector[N] p_1;
vector[N] p_2;
p_1 = x .* log_lambda + log_mu - log_mu_lambda - t_x .* mu1 - t_x .* lambda1;
p_2 = x .* log_lambda + log_lambda - log_mu_lambda - t_cal .* mu1 - t_cal .* lambda1;
return(log(exp(p_1) + exp(p_2)));
}
}

In the data and transformed data blocks, we specify the data that will be used in the model and perform a simple transformation so we have access to total transactions and repeat transactions for each customer.

data {
int<lower = 1> N; // Number of customers
int<lower = 0> N_months; // Number of months for ltv calibration
vector<lower = 0>[N] x; // Repeat transactions per customer (frequency)
vector<lower = 0>[N] t_x; // Time of most recent transation (recency)
vector<lower = 0>[N] t_cal; // Time since first transaction
vector[N] mx; // Average transaction amount
}
transformed data {
vector<lower = 0>[N] x_tot; // Total number of transactions per cust
x_tot = x + 1;
}

In the parameters and transformed parameters blocks, we declare all the parameters that we will estimate in the model, which is 7 + 3x the number of data points we have, and create some transformations of those parameters for modeling. We’ll need the logs of some parameters to fit the model, so we take advantage of Stan’s vectorized operations to take the logs of each parameter vector all at once. Then we feed the parameters and data through the custom likelihood function.

parameters {
vector<lower = 0>[N] lambda; // Transaction rate
vector<lower = 0>[N] mu; // Dropout probability
real<lower = 0> rr; // Transaction shape parameter
real<lower = 0> alpha; // Transaction scale parameter
real<lower = 0> ss; // Dropout shape parameter
real<lower = 0> beta; // Dropout scale parameter
real <lower=0> p; // Shape of trans amt gamma
vector<lower=0>[N] v; // Scale of trans amt gamma (cust specific)
real <lower=0> q; // Shape of scale dist gamma
real <lower=0> y; // Scale of scale dist gamma
}
transformed parameters {
vector[N] log_lambda;
vector[N] log_mu;
vector[N] log_mu_lambda;
vector[N] log_lik;
vector<lower=0> [N] px; // Shape of total spend distribution
vector<lower=0> [N] nx; // Scale of total spend distribution
px = p * x_tot;
nx = v .* x_tot;
log_lambda = log(lambda);
log_mu = log(mu);
log_mu_lambda = log(mu + lambda);
log_lik = llh(x, t_x, t_cal, lambda, mu, log_lambda, log_mu, log_mu_lambda, N);
}

In the model block, we specify prior and hyperprior distributions for our model

model {
// Priors for rates
lambda ~ gamma(rr, alpha);
mu ~ gamma(ss, beta);
rr ~ exponential(1);
alpha ~ exponential(1);
ss ~ exponential(1);
beta ~ exponential(1);
// Likelihood for rate
target += log_lik;
// Priors for spend
p ~ exponential(0.1);
q ~ exponential(0.1);
y ~ exponential(0.1);
v ~ gamma(q, y);
// Likelihood for spend
mx ~ gamma(px, nx);
}

Finally, in the generated quantities block, we generate full distributions for some quantities of interest for each customer in our data set, such as the probability that they are still active, their expected number of transactions over a given time period, their per-transaction spending, and their total value.

generated quantities {
vector[N] p_alive; // Probability that they are active
vector[N] exp_trans; // Expected number of transactions
vector[N] mx_pred; // Per transaction spend
vector[N] lt_val; // Lifetime value
for(i in 1:N) {
p_alive[i] = 1/(1+mu[i]/(mu[i]+lambda[i])*(exp((lambda[i]+mu[i])*(t_cal[i]-t_x[i]))-1));
exp_trans[i] = (lambda[i]/mu[i])*(1 - exp(-mu[i]*N_months));
mx_pred[i] = gamma_rng(px[i], nx[i]);
}
lt_val = exp_trans .* mx_pred;
}

We estimate the model with Hamiltonian Monte Carlo (HMC) by running four chains of 3000 iterations each.

RESULTS

This model can take a bit of time to run on large datasets, but it yields a lot of information. For each customer, we estimate full distributions for their total number of transactions and total spending, as well as their churn probability, transaction rates, and dropout rates. The plot below displays the distributions of churn probabilities for all our customers.

We can quickly identify which customers have churned or are at a high risk of churning, and which are probably still active. In this dataset, many customers have clearly churned or are probably still active, but many are at risk of churning and may require some additional marketing efforts or promotions.

Next we look at the distributions of future transactions for each customer, which incorporates estimates of their transaction rates and their dropout rates. Customers with the highest number of expected future transactions tend to be those with high transaction rates and low dropout rates.

In the image above, we can see that most customers are expected to make few if any transactions, but some may make a lot. It’s tough to see what’s going on with the customers at the top of the plot, because the distributions are so wide. In the plot below, we zoom in on some to see what the distributions look like. The most valuable customers are projected to make 40–50 transactions, and may make over 100.

Next, we look at the distributions of average per-transaction spending for each customer, displayed in the plot below. In this dataset, the vast majority of typically spend less than $100 per transaction, but some may spend up to $500.

Finally, we put everything together to obtain distributions of potential ‘lifetime value’ for each customer. For this model, we have calculated potential spending over the next 8 years, but we could set the time-frame to one month, 50 years, or whatever best fits our business needs.

We can quickly identify our potentially most valuable customers, whose value distributions are centered over $1000 with tails that reach past $1500. It might be worth spending a little money on marketing or discounts to make sure these customers stay loyal. Having full distributions for each customer is helpful because it gives us a sense of best and worst case scenarios overall, and lets us quantify our uncertainty about each customer.

NEXT STEPS

Having more data on customers beyond their purchase histories can let us expand the model in important ways. For example, we can use demographic data such as age, gender, and geography to create predictive models of lifetime value for hypothetical customers whom we haven’t seen yet. Then we can identify which customer segments are likely to have the greatest lifetime value, and develop marketing and sales plans accordingly.

APPENDIX

Data simulation and sample modeling code are available at https://github.com/alexpavlakis/clv

--

--