Notes from Industry

Forecasting with cohort-based models

An alternative to time series models when it comes to forecasting of paid subscriptions

Nicolai Vicol
Towards Data Science
12 min readNov 17, 2021

--

Photo by Michal Matlon on Unsplash

TLDR

A company offering subscriptions (e.g. Wix, Spotify, Dropbox, Grammarly) can forecast its future paid subscriptions using time series models, like ARIMA or Prophet. These models are trained on time series data containing subscriptions by dates.

An interesting alternative is to reformat the data to have the subscriptions by users’ registration dates and purchase dates, basically transforming the time series data into tabular data. This makes it possible to apply regression models, like GLM or GBM, which often produce better forecasts and also offer additional insights regarding the attribution of future subscriptions to cohorts of users. These models are called cohort-based models.

What is a Cohort?

By dictionary definition, a cohort is a “group of people with a shared characteristic, usually age”. In our case, the users registered on a given date represent a cohort. For example, the “cohort of 2019–01–01” consists of all users registered on 2019–01–01. Likewise, the “cohort of 2019” includes all users registered during 2019.

A few more definitions before we go further:

  • Registration date: the date when the user has registered;
  • Upgrade date (Purchase date): the date when the user purchased a premium subscription;
  • Age of a cohort/user: days since the registration date;
  • Premiums: paid subscriptions. A user first registers then buys a subscription, sometimes on the same day, other times only after using the product at no cost for a while. Many companies like Wix, Spotify, Dropbox have a “freemium” business model or offer free trial periods for their product.

The figure below illustrates the number of premiums generated by one hypothetical cohort registered on 2019–01–01, for the first 30 days after the registration.

Figure 1. Premiums by upgrade date & age for the cohort of the registration date of 2019–01–01. (Image by author)

Cohorts Behave (Almost) Similarly

Usually, cohorts behave similarly. When plotting the premiums of multiple cohorts of different registration dates by upgrade date, we can observe that they have similar shapes.

Figure 2. Cohorts of different registration dates by upgrade date — first 30 days. (Image by author)

The similarity is more evident when we plot the same cohorts by age instead of dates.

Figure 3. Cohorts of different registration dates by age — first 30 days. (Image by author)

Once again, if we plot the same cohorts by age, but now for 365 days after registration, we can observe some long tails, meaning that cohorts generate premiums long after registration.

Figure 4. Cohorts of different registration dates by age — first 365 days. (Image by author)

There are a few important characteristics to be observed in the figures above:

  • More premiums are generated in the first days after registration;
  • The rate at which new users are purchasing premiums is declining rapidly as they are “getting older”. The decline occurs in a very non-linear way, resembling a power-law relation;
  • A substantial number of subscriptions are being purchased a long time after users registered — “long tails”.

Forecasting Premiums by Cohorts

Let’s pretend that today is 2020–01–01 and we want to forecast new premiums for the 90 days ahead. Eventually, the premiums will come from existing cohorts (of users registered until today) and future cohorts (of users registered during the quarter starting tomorrow).

Recent cohorts

First, let’s take all existing cohorts registered during the last 365 days and call them recent.

The threshold of 365 days is arbitrarily chosen here. For some companies, depending on the tail, recent cohorts are those registered during the last 90 days, for others, it can be 2 years. The idea is to break down existing cohorts into recent and old and apply different prediction models. We will return to this later.

When plotting the recent cohorts, they may look like in the figure below. Our task is to forecast after today’s date, marked in red. The lines on the right of the red mark are unknown to us. That’s what we want to predict.

Figure 5. Recent cohorts. (Image by author)

Forecasting premiums produced by recent cohorts means extrapolating the “tails” of these cohorts into the future. For example, as you can see in the figure below, for an existing cohort registered on 2019–12–15, we’ll have to come up with the red dotted line to guess the true grey line which is unknown to us, but hopefully can be learned from older cohorts. The existing recent cohorts are also the easiest to forecast because we know more about them, most importantly, we know their size and we know their incipient dynamic.

Figure 6. Actual premiums of a recent cohort as of today, then the actuals and the forecast afterward. (Image by author)

Future cohorts

We’ll also have to come up with an estimate for future cohorts, born after today. These are marked in blue in the figure below. We don’t know too much about future cohorts, maybe except that we’ll have a new cohort every day during the next 90 days. The cohort born tomorrow will have 90 days to generate premiums, while the cohort registered on the last day of the forecast period will have only one day to generate premiums. Hopefully, these cohorts share the same features as past cohorts, and drawing the blue shapes is an exercise closer to data science than it is to painting.

Figure 7. Recent and future cohorts. (Image by author)

Old cohorts

Unless our product (or company) is younger than one year (the threshold we selected for recent), we eventually got to have old cohorts too. These are existing cohorts, registered before the recent ones and long before the forecast period starts. Let’s add them to the plot too and mark them in orange. They appear as a multitude of overlapping lines slightly above zero. There can be many of them. For example, if the history of the product starts in 2010, there will be about 3,285 cohort lines (9 years * 365 registration dates). Despite having small numbers, the total premiums generated by old cohorts can account for a substantial portion out of the total revenue.

Figure 8. Old, recent, and future cohorts. (Image by author)

Let’s now aggregate all cohorts by upgrade date and plot their totals. These are some nice-looking time series as you can see in the figure below. A few observations to make here:

  • Premiums by recent cohorts are dropping with time (remember the power-law decay by age).
  • Future cohorts account for more out of the total of future premiums.
  • Old cohorts may account for a substantial part out of the total.
Figure 9. Total premiums by old, recent, and future cohorts. (Image by author)

Let’s go one step further and sum up all three parts: old, recent, and future. This will get us the time series of total premiums, our target. That is the black bold line in the figure below.

Figure 10. Total of totals of premiums by old, recent, and future cohorts. (Image by author)

The technique described above is exactly what we do to forecast premiums. We break down cohorts into old, recent, and future. And for each part, we apply a separate regression model. We do that because these models are different in terms of distribution and available features. Each of these models predicts premiums for many cohorts (registration dates). And for each cohort, it predicts premiums for many upgrade dates in the future. We then aggregate the predictions of each model by upgrade date to obtain a time series for each part: old, recent, and future. In the end, we sum up all three parts together to obtain the total premiums by upgrade date. That is actually what a time series model would get us — premiums by future dates. Well, the cohort-based approach does the same in a more complicated way. This has its advantages when it comes to the accuracy of forecasts and additional insights about users.

In conclusion, we just proved that using the cohort-based approach we transformed the time-series task of forecasting into a regression task.

Time-series properties of the target to forecast

The time series of totals has some interesting properties that we need to model:

  • seasonality (weekly, yearly)
  • holidays (e.g. Christmas, Independence Day, Easter)
  • sales spikes during special events (e.g. Black Friday, Cyber Monday)
Figure 11. Time-series properties: seasonality, holidays. (Image by author)

And of course, when zooming out a bit, we may discover that there is also the trend that needs to be included in our models too (see figure below).

Figure 12. Time-series properties: trend. (Image by author)

Regression Models for Cohort-Based Forecasting

Before we go further, let’s recall the usual time-series method. We have the target we want to forecast by dates in form of a time series. A time series is a sequence of date-value pairs. We may also have exogenous variables which can be added to the model. The data may look like the table below.

Table 1. Time series data. (Image by author)

The usual candidate models to employ are Prophet, Holt-Winters, SARIMAX, LSTM, X11, SEATS, etc.

In contrast, the data for cohort models have a double key, as we represent the target by both registration dates and upgrade dates. Each cohort (registration date) has many upgrade dates. That’s a cartesian product between registration dates and upgrade dates. To these keys, we can join various features like age, events, holidays, seasonality terms, etc. Some will be joined by upgrade date (e.g. holidays), others will be joined by registration date (e.g. size of cohort), or even by both keys, for example, the age feature is computed as the difference between upgrade and registration. This tabular data, in the end, has more columns and more rows than the time series data.

Table 2. Cohort data. (Image by author)

The beautiful thing is that to this type of data, we can apply any regression model. Let’s consider a few.

Generalized Linear Models (GLM)

Pros: interpretability & simplicity; 2) extrapolation of the trend when including time as a feature; 3) supporting non-normal distributions from the exponential family (Poisson, Gamma, Tweedie).

Cons: 1) manual feature engineering (good to use: splines for non-linearities, tensor products for interactions); 2) few python packages have all the distributions and regularization; 3) sensitive to the choice of distribution and link function.

A Generalized Linear Model is a linear model that allows the target to have a distribution of errors other than normal. A GLM allows modeling the non-linear relation between features and the target via a link function. For example, we can assume that the conditional mean of our target follows a Poisson distribution and use a log-link function. The formula may look like the following:

(Image by author)

In Python, there are at least two good libraries for GLMs:statsmodels and scikit-learn. In R there is the famous mgcvpackage and the glmmethod from stats. See below a code stub using statsmodels.

Code example for GLM (source: author’s Github)

Gradient Boosted Machines (GBM)

Pros: 1) handling non-linearities and interactions by design; 2) none or little feature engineering is necessary; 3) supporting non-normal distributions from the exponential family (Poisson, Gamma, Tweedie); 4) less sensitive to the choice of distribution; 5) many good libraries to chose from (e.g. LightGBM, XGBoost, CatBoost);

Cons: 1) can’t extrapolate on unseen data, e.g. can’t extrapolate the trend;

Gradient Boosted Machines are powerful models that perform well in our case, particularly if data has no trend or only a weak trend. That is the only drawback of GBM in our case: it can’t extrapolate on unseen data, which implies that it can’t extrapolate the trend too. The eventual error becomes more significant for longer forecast horizons.

There are several good libraries for GBM: LightGBM, XGBoost, CatBoost. See below a code stub for lightgbm.

Code example for GBM (source: author’s Github)

Few Important Recommendations

Choose the right distribution

This is particularly important in the case of GLMs. The target is often very non-normal, having an acute right skew. The distribution often belongs to the exponential family. It can be Poisson, Gamma, or more generally Tweedie (with a variance power between 1 and 2). Try each of them through validation. As proof, in the figure below, the target remains highly skewed even after log-transformation. The usual trick of just normalizing the target with log-transformation or Box-Cox and assuming normality is not sufficient.

Figure 13. Distribution of target: raw and transformed. (Image by author)

Engineer features for the non-linear decay by age

Usually, “age” is the most important feature as it describes the non-linear decay of premiums — the main source of variation. As you can see in the figure below, there is an approximately linear relation between log(Premiums) and log(Age), hinting at a power-law relation between raw values.

Figure 14. Almost linear relation between log(Premiums) and log(Age), confirming the power-law relation between raw values. (Image by author)

This non-linearity can only be introduced into a linear model through a transformation of the raw feature of “age”. This is again particularly important for GLMs. GBMs will take care of this non-linearity automatically and the transformation of age may not necessarily help.

In the case of GLMs, if a log-transformation is not sufficient (often it’s not), I would recommend using B-splines. Sometimes even better if you consider a General Additive Model (GAM) model instead (because splines are part of the algorithm). In the figure below we have a set of multicolored B-splines, which are added together to form the red-bold line and approximate the non-linear decay properly.

Figure 15. Model the non-linear decay with B-splines. (Image by author)

There are a few Python libs offering B-splines: statsmodels, scikit-learn, patsy, pygam. The splines are added to the linear model as a set of features replacing the raw age. Each of these spline features gets a coefficient so that their sum forms a nice-looking shape like the red line in the figure above. The wiggliness should be controlled by imposing a penalty when fitting. In GAM models, it’s possible to impose penalties on derivatives too. Look below how we can construct a set of B-splines with statsmodels.

Code example for B splines (source: author’s Github)

Choose the right cut-off between old and recent cohorts

In this text, we have chosen the cut-off age between old and recent cohorts to be at 365 days. This threshold should be seen as a hyper-parameter to calibrate. A good rule of thumb is to start by plotting multiple existing cohorts by age. You will get a picture similar to the one below.

Figure 16. Choosing the cut-off age between old and recent. (Image by author)

We want to find the spot where the dependency of premiums by age vanishes and becomes constant. Start with that value, and then move left and right through a grid search to find the threshold that ensures the best separation between old and recent cohorts. The best separation is the one that ensures the lowest MAPE for the aggregate model (old + recent + future). This is important because we apply different models to recent and old cohorts. A higher cut-off age will designate more cohorts as recent and fewer cohorts as old. In other words, the weight of the recent model out of the aggregate model will increase as we increase the threshold.

Forecast old cohorts using a time series model

Continuing the previous recommendation, there are cases when it’s better to forecast the old cohorts with a time series model. This may happen because of two reasons:

  • Features joined by registration date with relevance to cohorts become insignificant for the model. For example, the age becomes irrelevant (remember the flat tails) and the other features like cohort size (number of users registered) also become unnecessary and can be dropped without loss of forecast precision.
  • There is a long history for old cohorts and the volume of data becomes too large to handle. Let’s say we have 10 years of history. The recent cohorts take one year of that and the model for old takes 9 years. Eventually, the table for old will have about 5,397,255 rows (9 years * 365 registration dates with 1,643 upgrade dates on average each, an arithmetic sum — first registration date has 9*365 upgrade dates, while the last registration date has one upgrade date only).

Use rolling walk-forward validation

Even if we deal with tabular data, we must respect the time-series nature of the data. We must avoid the look-ahead bias, meaning that we must always train on the past and forecast into the future. A good approach for validation is to perform a rolling walk-forward validation by the upgrade date. This technique is illustrated in the figure below.

Figure 17. Rolling walk forward validation by upgrade date. (Image by author)

Synthetic Data

Normally, you won’t find historical data with the premium subscriptions of a given company. This is financial data, meaning it’s sensitive and won’t be in public. It’s the same reason why you won’t find too many articles on how to forecast premium subscriptions. For all the figures in this article, I have used synthetic data which I have generated in a way that mimics the main properties observed while working with my company’s data at Wix.com. You can generate the same synthetic data with this Python script.

Special thanks to my colleague Nicolas Marcille who initially started the work on cohort models at Wix.com.

--

--

Data scientist at Wix.com. Interested in forecasting, operation research, search systems, recommendation systems.