Modeling customers’ churn? Start here

Netta Shachar
Yotpo Engineering

--

One of the most important KPIs for any business is the customers’ retention rate, i.e. what percent of customers will purchase again. The higher the retention rate — the happier the business owner.

We often think of churn rate as the complementary event of the retention rate: churn rate=1-retention rate

This makes sense when the business sells a subscription: if a customer didn’t renew her subscription — she churned. However, when products are sold, this perspective is inaccurate: a customer who hasn’t made an additional purchase yet, may still purchase tomorrow! This means that churn is not directly observable, which is crucial when building a churn prediction model.

At Yotpo, we recently introduced a churn prediction ability for our “Loyalty & Referral” clients. This post describes 3 modeling frameworks we considered (classification, BTYD and survival models), and shares insights from our journey towards a production model.

But, as with any data science problem, first — the data.

Preparing the learning data

Since our clients are e-commerce stores, we assume the data is from such a store. Nevertheless, the ideas and principles presented throughout the blog can be easily extended to other domains, such as gaming.
Purchase history data, from which you infer customers’ timelines, is the starting point of any churn model:

Illustration of purchase history data
Customer A’s purchases on a timeline

When working with temporal data, you need to ask yourself (and answer) two questions:

  1. How much history (what’s the look back window) should I use for learning data? While there is no magic number, the guidelines that helped us choose an appropriate window are: a) Allow enough customers to make multiple purchases, i.e. if the average purchase rate is 3 months, don’t use a 3 month window b) If your store underwent significant changes, e.g. turning a store to fully vegan, don’t use data collected prior to the change c) If your store has seasonal effects, include all “seasons” in your data, or account for them otherwise.
    Since Yotpo’s solution has to work for various stores, we considered windows ranging between two months and two years. Seasonal effects, including “special events” such as Black Friday, were addressed using a unique sampling technique on customers’ events.
  2. How should I split data into Train/Test? Randomly splitting data to train/test (and validation, if necessary) is a bad idea when dealing with temporal data: If test instances occur prior to train instances, it means your test predictions are based on the future! Be sure to split so that validation occurs after train and before test.
Train/ Validation/ Test splits should be conducted according to the timeline

Features

Given purchase history data, the most commonly used features for churn predictions are:
1. Frequency — number of purchases in the specified time period
2. Recency — time since last purchase

Other features/data sources can — and should — be considered, if available. Among the features we explored at Yotpo are: average time between purchases, average purchase amount, newsletter subscription indicator, and others.
You can also get some “feature inspiration” from this paper by Asos, discussing a closely related problem called Lifetime Value (LTV).

Ok, so now we have learning data and some features. Time to talk about….

Modeling Frameworks

1. Classification models

Churn is binary — either one buys again or not. So, binary classification seems like a natural framework.
Given labeled data of churned (label=1) and active (label=0) customers with predictive features, we can choose an algorithm, split the data and start training!
Sounds like a done deal, but building labeled data from purchase history data requires you to make one key decision:
When should we label a customer as “churned”?
To answer this question, consider two stores: Store A sells fruits & vegetables and Store B sells ski gear. A customer of Store A who hasn’t purchased in 3 months has likely churned, whereas in Store B, 3 months is not enough to assume churn.
So, what’s the “correct” number for your store?
Since it depends on your business — find the answer in your data!
Look at the distribution of time between purchases, and set a “churn” threshold at the tail of the distribution. The exact threshold depends on your desired “false churn”/”false active” ratio. This blog post discusses the classification approach in more detail.

2. “Buy Till You Die” (BTYD) models

This group of unsupervised models quantify churn probability by assessing the expected number of future transactions and probability of being “alive”, using the frequency and recency features only.
Pareto/NBD, BG/NBD and MBG/NBD models are the 3 main models of this framework.
All these models assume that the number of purchases for active customers follows Negative Binomial counting process (NBD), meaning:

  1. Number of transactions ~ Poisson(λ)
  2. Different customers may have different λ values. We assume a prior on transaction rate: λ ~ Gamma(shape=r, scale=α)
Gamma distribution is used as a prior on the transaction rate in all BTYD models

The dropout process varies between models:

Pareto/NBD model assumes:

  1. Customers have an unobserved “lifetime” of length τ . Lifetime ~ Exp(μ)
  2. Different customers may have different μ. We assume a prior on lifetime length: μ ~ Gamma(shape=s, scale=β)

BG/NBD model assumes:

  1. A customer may drop out (churn) only immediately after a repeated transaction. The point a customer drops out~ (shifted) Geom(p)
  2. Different customers may have different p values. We assume a prior on drop out probability: p ~ Beta(a, b)

MBG/NBD model assumptions:
Similar to BG/NBD, except it allows customers to dropout after the first transaction

In addition, all models assume independence between purchase rate and dropout rate.

Beta Distribution is used as a prior on the drop out probability in BG/NBD and MBG/NBD models

In practice, Pareto/NBD is rarely used due to computational challenges, while BG/NBD is the most commonly used (this blog gives a classic example).
Having said that, in our e-commerce settings, we found that MBG/NBD outperformed BG/NBD thanks to its preferred modeling of one time buyers (significant portion of customers in many online stores), and both models were outperformed by Pareto/NBD.

3. Survival models

This last framework is less familiar among data scientists, and for no good reason.
Designed to estimate the time until an event occurs, survival models are often used for evaluating medical treatment efficiency (“event” = death) or machines lifetime (“event” = failure time).
Survival models have characteristics that make them perfect for churn prediction:

  1. They account for censored data points, i.e. they take into consideration that an event that didn’t occur during the “observation period” can occur in the future
  2. They allow different start times for different customers
  3. They output a survival curve for each customer, presenting the “no event” probability at each time point in the future

For a detailed introduction to survival models, we recommend this blog.
The most widely used survival models are statistical models, such as Cox-PH and Weibull regression. But some ML models exist as well — this blog suggests using NN, this R library uses Gradient Boosting machine, and this one uses Random Forest.

Survival curves allow us to estimate the probability of surviving without an “event” for various periods

Frameworks’ comparison

So, before we wrap up, let's summarize it all into a single table:

Final thoughts

As usual in data science, there is no one algorithm to rule them all. And, as usual in data science, your data has the answers. This review has not covered all possible approaches, or given the complete how-to for each approach. Nonetheless, we hope it serves as a good starting point for additional research, and that this post gave you some directions to think of — or even better — focused you on the direction that suits your problem best.

--

--