Predicting Customer Churn
A Modeling Exercise
We’d like to predict, for a given product, whether a user will churn — i.e., stop using it— in the ‘near future’. We can do a number of things with such a prediction:
- Estimate near-future revenue. This is especially useful when there is likely recurring revenue from continued use, such as in a subscription service, or from in-app purchases.
- Proactively take some “remedial action” for users likely to churn.
What should we use as predictors? The user’s recent usage history is a great candidate. While additional predictors can be considered, in this post we will focus on just this one.
Let’s see an example. Imagine the user just installed an iPhone App. Suppose the App logs the number of times each day it is launched. Such data is called a time series. f(d) is the number of times the customer launched the app on day d.
Intuition suggests that a down-trending usage frequency — i.e., f(d) decreasing with increasing d — is a sign of ‘losing interest’. So likely predictive of future churn. How do we test this intuition? More broadly, we might wonder: how predictive is churn from just this type of data? How much data do we need? Will only a few weeks suffice?
Often there are no definitive answers to such questions upfront. The best thing to do is to build models and evaluate — empirically — how well they perform.
Data Science Modeling
First we need data. In our example, this takes the form of f(1), f(2), …, f(k) where f(i) is the number of times the user launched the app on day i. Okay, so how do we predict churn from it? We need the following:
- Data for many users. So we can learn common patterns across users that are predictive of churn. For example, one hopes to learn that a sufficiently decreasing f(d) as d increases is indeed predictive.
- For each user, the data needs to be labeled with whether the user churned or not.
In more detail, for every user u, we will have the daily usage time series. This will start on the first day the user downloaded the app. This may differ for different users. For our purposes, let’s assume the time series for every user will continue till the present day, even if the user stopped using the app months or years ago. This just means that there may be a long run of zeros on the right tail of a time series. Other than wasting computer memory, this has no detrimental effect.
Labeling The Data
Say the user has not used the product for a while. How long should “not for a while” be before we deem the user to have churned? Some times we just don’t know. So let’s encapsulate our ignorance into a free parameter. Let’s call it inactive duration.
So how do we use this parameter in our modeling. Perhaps we could try out different values — a week?, two weeks?, a month? — and see what effect each has on how well we can predict.
Okay, moving to finer details,. how do we use this parameter in our labeling? We label the user as churned if the right tail of the daily usage count time series has at least inactive duration consecutive zeros and active if not.
Let’s see an example. Below are the daily usage time series of five users.
U1: 5, 6, 7, 8, 4, 3, 2, 3, 0
U2: 5, 2, 0, 0, 0, 0, 0, 0, 0
U3: 5, 2, 0, 6, 8, 3, 2, 1, 3, 0
U4: 3, 2, 1, 0, 1, 2, 0, 2, 0
U5: 7, 4, 3, 2, 0, 0, 0, 0, 0
Inactive duration is 5 days. Under this setting, the users in bold — U2 and U5 — have churned, the other three not.
So now we have turned our prediction problem into a binary classification one. This is a standard type of problem in machine learning. Meaning there are out-of-the-box algorithms and tools to solve it. More on that later.
Our modeling is not yet done. We now know what to predict — churned or active — but not what to predict it from. You might ask, we have the user’s daily usage time series, isn’t that what we predict from? Well, yes, kind of. However standard black-box algorithms don’t accept time series as predictors. Predictors are made from a so-called feature vector. This is a fixed-dimensional array of numbers. By contrast, our time series doesn’t have a fixed width. Some users may have only installed the product in the past month, others may have been using it for years.
The problem of going from the ‘raw’ input — in our case the daily usage time series — to a feature vector is called feature engineering. In practice this can be a complex endeavor involving experienced data scientists working in combination with domain experts.
Okay, let’s get back to our problem. What would be sensible features? One idea, discussed briefly earlier in this post, is to derive a feature that measures whether the daily usage time series is up-trending, down-trending, or flat. In the former two, also capture the extent.
The intuition is that a down-trending time series is a sign of ‘the user losing interest’ and thereby a potential predictor of churned.
This brings up another question: what should be the time scale on which trends are measured? Days? Weeks? This might heavily depend on the App type. Typical usage of a texting app may be multiple uses a day. Typical usage of apersonal finance app might on the other hand only be once a week. Knowledge of the product type can thus guide us to the time scale, or scales, we should use. This would be an example of leveraging domain knowledge towards feature engineering.
In this post, we restrict modeling to a single suitable and known time scale. Domain knowledge of the App type, as discussed in the previous paragraph, might help with this choice. In a later blog post we will address the more general situation of modeling at multiple time scales.
The Specific Features
Let’s divide time into units 1, 2, 3, … These units may be mapped to any time scale by the choice of the unit width (e.g. three hours, one day, one week, …). Let x(1), x(2), …, x(n) denote the usage counts for any one user at the time scale being modeled.
We want one or more features that measure what the trend is — up, down, or flat — and in the former two its strength. First, let’s observe that for our purposes, we don’t need to capture the fine trend — which might possibly be quite elaborate — only the trend implied by connecting the early and late unit-time usage counts by a line. This is because we only care about predicting churned vs active.
Our specific features arising from a suitable formalization of early and late, combined with a robust measurement of the activity at these units. Early is clear. Set it to time unit 1. What about Late? Not clear. So let’s capture this in a parameter, call it k. Okay, what next? What do we mean by “robust measurement”. To appreciate this, let’s first write out our family of features, parametrized by k, in non-robust form. This is
The idea is simple: a negative value of the feature is indicative of a downtrend. The problem is that this indicator is not robust. Consider the time series 4, 3, 4, 2, 0, 4, 3 and let k=5. The value of this feature is x(5)-x(1) = 0–4 = -4 which implies a downtrend at this k. Looking at neighboring values reveals however that this is not really a downtrend. How do we fix this. We replace x(i) by a neighborhood-weighted average. We formalize this in terms of a Kernel function K parametrized by a width parameter w. K(i,w) takes the values in the time series in the neighborhood of x(i) defined by w and returns a suitable value. In this notation, we would express our feature as
To appreciate the potential robustness we have gained from this, let’s see some sensible Kernel functions. Our first choice, K(i)=x(i) yields our non-robust estimate. Our second meta-choice K(i,w) takes the average of x(i-w), x(i-w+1), …, x(i), x(i+1), …, x(i+w). Our third meta-choice generalizes “average” in our second meta-choice to a suitable weighted average. A sensible weighting of the contribution of x(j) to K(i,w) would decay exponentially with the distance |j-i| between i and j. The continuous version of this is a Gaussian Kernel. The Kernel’s width parameter w gets re-purposed to become the Gaussian’s standard deviation.
How exactly is K(1,w) defined since the time series starts at x(1). One idea is to always ignore those x-values needed in the Kernel calculation that are undefined. So K(1,w) in effect would be a suitable average of x(1) and its right-neighborhood of suitable width.
Of course, while by introducing the notion of a Kernel function we have gained modeling power, we have also complicated our life a bit. Now we need to decide on the form of the Kernel function and also on its width. That said, there are reasonable choices of this which can only be an improvement over not using a Kernel. Such as (i) set w=1 and use exponentially-decaying weights.
Let’s end by summarizing the contents of this post. We addressed the problem of predicting whether a user would churn — i.e. stop using the product — or not based on the user’s recent usage. We modeled usage over time as a time series. We had to decide how to label the data, i.e what’s the minimum duration of inactivity that we would deem as indicative of churn. We discussed what features to extract from the time series that would help us predict churn. We reasoned that features that measure whether usage is down-trending or not ought to be predictive. Down-trending being a sign of ‘losing interest’. We formalized how to measure trends in a robust way using the notion of Kernel functions to smooth rough edges. Think weighted moving averages.
We discussed the need for modeling features at multiple time scales but left doing so for another post.