What makes predicting customer churn a challenge?

9 min readSep 15, 2017

Staying on top of customer churn is an essential requirement of a healthy and successful business. Particularly, most companies with a subscription based business model regularly monitor churn rate of their customer base. In addition, cost of acquiring new customers is typically high. This makes predictive models of customer churn appealing as they enable companies to maintain their existing customers at a higher rate. Although defining and predicting customer churn might appear straightforward initially, it involves several practical challenges. This article presents a discussion of some of these challenges as per our experience modelling customer churn at Tucows. Here at Tucows, gaining a deep understanding of our customers’ needs and priorities is a key aspect of our business. This is particularly the case for our Ting mobile customers as reflected in our highly-acclaimed customer service experience.

Characterizing customer churn

There are three main challenges here:

What constitutes a churn event? when is it appropriate to label a customer as churner? For a subscription based business churn event could be defined as when a subscription is terminated either by customer or company. For a non-subscription based business, however, churn becomes a rather fuzzy concept. For instance, a customer can interact with an online store at anytime. So, what does it mean to say a customer has churned in this scenario? A workaround solution is to consider a customer as churner if they have had no (purchase) interaction for the last say 30 days. Although this approach is used typically it does not work for customers with “burst behaviour”, i.e. customers who have sporadic yet bursty interactions (see Figure 1). Methods like this are more appropriate for dealing with such customers.

**Figure 1: Sample customer activity profiles (a) regular customer activity (b) bursty sporadic activity**

How to compute (monthly) churn rate? churn rate is meant to quantify the percentage of customers leaving a business relative to its base size. This is rather ill-defined as one then has to ask leaving business within which time period, which means the customer base size itself can no longer assumed to be static. There is an interesting article discussing a variety of methods for estimating churn rate while accounting for this challenge.
What type of churn? although on the surface customer churn might seem like a single problem, oftentimes in practice there are different types of churn occurring driven by different set of motives. For instance, at Tucows Inc we have two clearly distinct types of churn for our Ting customers, namely, voluntary and involuntary. The voluntary churners are those who leave by either cancelling or porting out to another carrier whereas our involuntary churners are terminated by us due to unpaid bills, fraudulent activity, etc. Accordingly, a single model would have a hard time capturing such complex patterns and having a separate model for each churn type is preferred. As discussed in this interesting blog using the separate churn modelling approach can lead to significant improvements in prediction performance.

Data related challenges

The dataset used to model customer churn is typically in ({features}, label) form, where features are a set of various customer metrics and label is set to 1 if customer is considered a churner and 0 otherwise. Both features and label data entail several challenges:

Messy data: raw data tables existing in a company’s data warehouse are rarely a in format suitable for churn modelling. In order to bring these mess of data into a proper form one has to perform the so-called ETL (extract-transform-load) and feature engineering. This entails tasks such as identifying/selecting potentially useful features, developing SQL scripts to extract them from database tables, removing outlier records (e.g. customers with outlandish features), various data transformations, e.g. box-cox transformation to ensure data normality, etc.

Data wrangling is typically a labour-intense and crucial part of the modelling process, yet, unfortunately is not attracting enough attention especially in academia.

Low churn rate: customer churn is normally a relatively rare event presuming business to be in a good shape. This leads to the so-called class imbalance issue where the number of churner customers is much smaller than non-churner (majority) customers. A severe class imbalance can cause poor predictions by churn model. This is due to the fact that most machine learning models learn by maximising overall accuracy. In case of a severe class imbalance churn model can get a high accuracy simply by predicting all samples as the majority (non-churner) class without really learning anything about the minority class. There are a number of approaches to alleviate class imbalance problem as discussed here. Simple down/up sampling, as well as some advanced sampling methods such as SMOTE, are among such approaches.
Churn event censorship: theoretically all customers eventually churn at some point, i.e. given long enough time all dataset labels will be 1. So, customers that are considered as non-churners while learning churn model are indeed only partially observed, i.e. their churn event is censored (see Figure 2). The churn event censorship is problematic for conventional machine learning methods that require dataset labels to be fully observable. The survival models are an attractive alternative in such scenarios as discussed further in the next section.

**Figure 2: Demonstration of (right) censorship of churn event data**

Feature responsiveness: there are two types of features used to model churn, namely, aggregate (e.g. average monthly bill) and time series (e.g. data usages over the last six months). Aggregate features are generally easier to collect and model, hence, more commonly used. However, in our experiments we noticed a problem with simple linear averaging for aggregation. Simple linear averaging assigns equal weight to all samples. This is not appropriate when it comes to churn modelling as customers might exhibit a drastically different pattern close to their churn event, which cannot be properly captured with simple averaging. This is particularly the case for long-tenure customers. One way to alleviate this issue is to use moving average, and more importantly, exponential moving average methods that assign a (tune-able) higher weight to the most recent samples.

Churn modelling

There are two main challenges when it comes to modelling churn. First, one has to develop and validate an efficient churn prediction model using the proper method. Once in production, one then has to constantly monitor the model’s performance over time and re-train/develop it if necessary.

The churn modelling approaches can be presented in three categories:

Binary classification: this approach ignores the aforementioned churn event censorship and treats churn modelling as simply learning a binary classifier. Although this approach is used fairly often in the literature its inability to deal with censorship as well as sensitivity to imbalance in class labels, especially in low churn rate applications, makes it the inferior modelling choice. In our case, we experimented with two binary classifiers, namely, random forest (RF) and wide and deep neural network (WD-NN). Both of these methods achieved a very good performance (in terms of precision and recall metrics) during training and validation phases. However, when tested to predict churners of the subsequent month, their performance was far less satisfactory. We suspected three reasons to underlie this drop in performance. First, models were examined to identify if over-fitting was present. As expected, over-fitting was not the issue for the RF model (RF is generally robust to the choice of hyper-parameters). On the other hand, the WD-NN model had some over-fitting, which was subsequently alleviated by adding more regularisation. Second, rather severe class imbalance in our dataset due to our low customer churn rate. This issue was alleviated using a SMOTE sampling method. Third, the inability of binary classifiers to capture censorship of churn label data. This observation led us to the survival regression methods discussed next.
Survival regression: survival analysis models are the well-known methods of choice when modelling time to event datasets. For instance, the Kaplan-Meier (KM) model is a popular non-parametric survival analysis approach. Given the duration until churn event (or censorship) for a set of customers, the KM model provides their overall survival curve, i.e. probability of survival over time. Survival regression models take this to the next level by incorporating features associated with customers (as covariates) in the modelling process. Various survival regression models assume different linear relationships for how covariates relate to the risk of customer churn event. For instance, the Cox and Aalen models assume multiplicative and additive relationships, respectively. Survival regression models do not label customers as churner and non-churners. They provide a survival curve, which can be used to compute the expected time to churn event for each customer. Accordingly, a customer can be regarded as churner if their predicted time to churn is close (based on a preset threshold) to their current length of tenure. In our case, we experimented with using the Cox method for churn modelling. As with the binary classifier methods above, the model yielded good performance (in terms of concordance index measure) during training and validation phases. However, its performance dropped when tested to predict subsequent month’s churners. We suspected two main reasons. First, the issue of feature responsiveness as discussed above as our aggregate features were computed using simple linear averaging process. We plan to experiment with the more advanced aggregation methods such as exponential moving averaging to improve the responsiveness of our aggregate features. Second, the linear covariate to (churn) risk assumption of the Cox model, which might not be appropriate in our case. Our conjecture is that with the US Telecom market evolving over time, the impact of various features on our Ting customers’ churn has varied. Accordingly, more advanced survival regression methods capable of capturing such non-linearity are required. This observation led us to the hybrid models discussed below.
Hybrid models: recently, a number of methods have been proposed to deal with survival classification problems involving complex non-linear customer churn risk functions. These methods are generally developed by extending the popular non-linear binary classification methods to censored survival data. The RF-SRC and deepSurv are two of such hybrid methods, which are extensions of the Random Forest and deep neural nets, respectively. We plan to experiment with modelling churn using these powerful methods in the future.

For the sake of completeness, we should mention an interesting recent method called WTTE-RNN where the author essentially turns the churn modelling strategy on its head. The proposal is to predict time to next non-churn event, as opposed to churn event (see Figure 3). This clever formulation yields a more appropriate mathematical formulation of the churn problem.

**Figure 3: Predicting time till next non-churn event in** **WTTE-RNN**

Concept drift

As mentioned earlier, once a good churn model has been developed, validated, and deployed in production there would be yet another challenge. It has to do with the dynamic nature of churn problem and the notion of concept drift. Simply put, a churn model that works well today could cease to perform in the future due to changes in customers’ behavioural patterns driving churn. For instance, the high cost of voice calls could be causing customers to churn today. However, as more and more voice calls are made through data the cost of voice calls becomes less and less relevant to churn. A review of concept drift methodologies is presented here.

Tackling concept drift is an important part of maintaining churn models in production yet it is often not discussed in the churn modelling literature.

Next steps

By listening to the immediate feedback provided by our Ting customers we have already managed to reduce churn. Nonetheless, having an accurate churn prediction model will enable us to further strengthen customer feedback and accommodate their needs. In order to achieve better churn prediction results we plan to enhance our training dataset, as well as experiment with churn modelling using advanced hybrid approaches. Our current training dataset is mainly comprised of a set of aggregate features computed using simple linear averaging. A well-tuned exponential moving averaging method can significantly enhance the “responsiveness” of our aggregate features. In addition, we plan to experiment with modelling churn using a dataset augmented with time series features such as customers usage stats within their last six months. Churn modelling with such time series features should enable discovery and incorporation of more complex customer patterns into churn prediction. The hybrid churn modelling methods discussed earlier might also prove useful as they allow for capturing non-linear relationship between features and customer churn risk. Between the two hybrid alternatives discussed, the RF-SRC appears more appealing as it would be easier to tune (similar to RF) than deepSurv. Lastly, the recently proposed WTTE-RNN approach appears like a promising alternative to churn modelling as it can handle censorship and time series data, and is based on a mathematically sound formulation of churn. We plan to experiment with this method in the future. If you find customer churn modelling an exciting journey and feel like you have a knack for it then give us a shout. We are hiring! :)