Estimating customer churn based on usage data

Haribabu Inuganti
Data Science at Microsoft
7 min readMay 31, 2022

For any product or service to be successful in the long term, retaining customers is very important. This is because 1) There may be a ceiling in the number of new customers that can be acquired; 2) It might become costly to acquire new customers; and 3) Retained customers may also make repeat purchases, making themselves even more profitable. And of course, a retained and happy customer may become a product’s advocate.

In the current article, I explore how to identify churned or retained customers by looking at product usage patterns to understand from the data whether a churn problem exists. I also cover how to identify a churned customer based on usage data, how to understand churn patterns using Kaplan–Meier curves and heatmaps, and how to use statistical testing to determine whether there is a churn problem. Let’s get started!

Identifying a churned customer

Churn can occur with the explicit cancellation of a subscription (such as in the cases of subscription-based services like Netflix or cloud services) or more passively by simply not using a product or service. While it might be straightforward to identify churn based on cancellation, it might be necessary to infer churn based on non-usage patterns if there is no explicit event like cancellation. Customers either not using a product or service or using it less frequently might be a concern for services where billing is based on usage, as lower usage implies lower revenue. Even for services where there is a fixed fee per period regardless of usage (such as with Spotify or Netflix), inactive customers might be more likely to cancel their subscriptions in the future because they are not deriving much utility out of the subscription.

So how is it possible to identify a churned customer from usage data? Usage churn, in simple terms, refers to customers who have stopped using the product. In practice, however, it is difficult to determine whether a customer has stopped using a product because different customers may have differing periods of inactivity. For example, some might simply stop using the service for a few days and then start consuming later. In these cases, non-usage would not imply churn. As a result, we want to answer the question “How many days/weeks/months of inactivity would imply churn?”

To answer this, we can review customers showing inactivity to determine the threshold number of days by which most customers, if they resume, start using the service again after a period of inactivity. This threshold can then be used to identify churned customers in the usage data. The following is example data of customers using a service over 180 days (about six months).

Example data of customers with activity by day from their first date, with 1 indicating usage and 0 non-usage.

The first customer has no activity from Day 3 to Day 7 but resumes using the service on Day 8. The last customer does not have any activity from Day 6 to Day 180 (and so looks like a churned customer). If we have the data for a long enough period, such as for 180 days as shown in the example, it would be very easy to understand whether a customer is churned or not. But we might be interested to know the share of churn from very recent customers (such as customers from recent months with only 30 to 60 days of data) so that we can take some action to try and retain them.

To achieve this, we can analyze patterns from cohorts with full data (such as the 180 days shown in the example) to determine the period by which, if a customer is returning after inactivity, is most likely to return. The following illustration depicts simulated data on patterns of the number of days with inactivity.

Distribution of customers by number of days with inactivity.

We can see from the distribution above that 80 percent of customers take fewer than two weeks to resume usage. We can therefore use this as a cut-off to define a churned customer. (A different distribution might suggest a different cut-off.) Using this cut-off, we can identify a customer with no usage for two weeks as churned and identify the rest of the customers as retained. While this cut-off–based categorization is not true for all customers, we know that it is true for more than 80 percent of customers.

Customer base with new and churning customers

It is also worth noting that among the customers whom we have identified as churned based on the cut-off, some would come back and start using the service — but this share would be very low. We can identify these customers as rejoined customers. The following diagram captures this concept:

Identifying churn and retention from a cohort.

We observe the cohort of rejoined customers because the cut-off–based threshold would cover only a certain share of all churned customers (80 percent in our example). It might be useful to keep a check on the share of rejoined customers over time: If there is a significant change, it implies that we might need to revisit our threshold.

Also, the above example is based on a service for which tracking usage at a daily level makes sense. For some products, however, it may be appropriate to track them at a weekly or monthly level. But the concept can be extended to these products as well.

Understanding churn patterns

Once we establish the cut-off, we can use it to understand patterns of churn over the time. The Kaplan–Meier estimator curve and retention heat maps are commonly used to understand the churn pattern over time. The KM (Kaplan–Meier) curve shows the pattern of churn over the timeline from the acquisition or start date. The following is an example of a KM curve:

An example KM curve (source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3932959/).

At each point on the x-axis, the y-axis shows a decrease (in other words, a churned count for the at-risk population at that point of time) by a percentage of the remaining height. Because the approach is, by design, to take the at-risk population at each point, it scales well for cases in which the base population for churn must be changed across the timeline, such as in cases of censored data or cases in which we are aggregating the result over different acquisition cohorts. Also, the KM curve gives the point estimate as well as the interval for churn at any point of time during the period. This interval makes it easy for estimation as well for comparing churn for two different categories.

The following is an example of a retention heat map shown over several weeks from the time of acquisition. The heat map represents the percentage of customers retained over time.

Retention heat map.

We are assigning churn to the week the customer is last seen before the churn; hence we are seeing values of churn for Week 1 although our threshold time is two weeks. The heatmap allows us to compare different acquisition cohorts to see whether churn is increasing or decreasing for a particular cohort.

While we see that for the week of January 15 the retention rates are lower, the difference of two percentage points that we observe for Week 3 of the acquisition cohorts on January 1 and January 15 might be merely noise or a real drop. We can use statistical hypothesis testing methodology to check whether this difference is significantly lower. As these are two independent observations with large sample size, we can use the z-test for proportions. Note that for small samples a binomial continuity correction factor must be used. The following shows the methodology of the test:

Ho: The difference between the two proportions is zero

Ha: The difference between the two proportions is non-zero

Alpha = 0.05 for a two-tailed test

Z statistic = The difference in proportions / The standard error

The difference in proportions = 89% – 91% = 2%

P^ = 0.9 , SEo = 0.004

Z = 0.02 / 0.004 = 4.93

P(Z> 4.93) = 0.00 for two-tailed test, 0.00*2 = 0.00

We reject the null hypothesis as the p-value is less than the alpha of 0.05.

Python packages can be used to conduct the testing. The following shows example code. These packages readily give the p-value.

Z-test for proportion in Python.

Unsurprisingly, the results from formula-based calculations and from the Python-based package are similar.

Next steps

Once we identify that there is a churn problem, we must investigate what is driving it. There can be many reasons that a churn problem exists and so to assist our investigation it is good to have a list of the top drivers that commonly influence retention. To identify the top drivers, one can use a Machine Learning–based approach that gives a feature ranking for different features on the outcome. SHAP values are a great choice for this exercise. Once we have the list of the features, we can check whether there are any changes in the top features that influence the churn rate. This ranked feature can also be useful for identifying areas where a team can invest to improve the retention rate. Initiatives that make an impact on top-ranked features would give more ROI compared to the initiatives that influence lower-ranked features. A model with churn probabilities for each customer can also be used by a marketing team to intervene and stop churn from the most valuable customers.

References

Kaplan–Meier curves: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3932959/

Z-test for proportions: https://www.spss-tutorials.com/z-test-2-independent-proportions/

Haribabu Inuganti is on LinkedIn.

--

--

Haribabu Inuganti
Data Science at Microsoft

Data Scientist @ Microsoft. https://www.linkedin.com/in/haribabuinuganti/. Interests: Data Science, Personal Finance, Technology , Human Behaviour