Strategic Customer Retention: A Comparative Analysis of Churn Prediction Models

Published in

Data Reply IT | DataTech

13 min readMay 14, 2024

Introduction

In recent years, markets have grown increasingly competitive, prompting businesses to find a balance between keeping existing customers and acquiring new ones. It’s widely acknowledged that maintaining existing customers is more cost-effective [1]. Therefore, it’s crucial for businesses to implement reliable systems to forecast customer churn. Customer churn prediction means estimating the probability that a customer will leave a service or a product, opting instead for a competitor’s offering. The aim of this Machine Learning task is to foresee such decisions before they occur, allowing business to intervene promptly [2].

The common customer churn prediction literature focuses on industries including mobile and telecommunication, banking, insurance, retail market, gaming, logistics, e-commerce, and media [4], where the market is highly competitive and the need to improve customer satisfaction, loyalty, and retention is high.

This article explores different predictive modeling techniques applied to churn prediction, comparing traditional binary classification models with advanced survival analysis and deep learning methods. Firstly, we conduct the comparison from a theoretical point of view, describing the main differences between the already cited class of models. Finally, we carry on a case study using the Telco Customer Churn dataset, where our aim is to evaluate the performance of several models and to discuss the implications of their results in a real-world business context.

Theoretical background

Machine learning is a prevalent approach for predicting customer churn. Traditionally, this has been achieved through classification algorithms that distinguish between the behavioral patterns of churners and non-churners. However, in recent years, survival analysis has gained prominence due to its capacity to provide probabilities that take into account the time until an event occurs, unlike binary classification models that simply predict event occurrence. Traditional methods usually take individual hazard function as the main target, followed by making some assumptions about it to predict the probability of event occurrence at various time points. The Cox proportional hazard model (CoxPH) [8] is the most commonly used predictive model for survival analysis. It assumes that the ratio of an individual hazard function to the population’s baseline hazard function is a time-independent constant, which is called the hazard ratio and is also a predictor of the CoxPH model.

Among various survival analysis methods, a branch of methods such as gradient boosting machine (GBM) [9] and deep survival network (DeepSurv) [10] applies machine learning techniques to enhance the capability of representing a complex nonlinear relationship between logarithmic hazard ratio and static covariates. Another branch of methods, however, aims at predicting the distribution of the First Hitting Time (FHT) instead of assuming individual hazard function. Models such as DeepHit [11] and DRSA [12], apply deep neural network and recurrent neural network to capture the latent representation between static covariates and the FHT probability distribution, respectively. However, fitting deep learning models often requires a large number of training samples, careful hyper-parameters tuning, and iteration training, which could be very time-consuming. Moreover, the complex neural network model is a black box with very poor interpretability, making the algorithm incapable of finding churn-related important prognostic factors which is often required in churn prediction studies.

Despite the proliferation of DL-based survival methods, there remains a lack of comprehensive systematic reviews that consolidate these approaches [4].

Classification Methods for Churn Predictions

Classification methods are a foundation in the machine learning toolbox for predicting customer churn. These methods work by categorizing customers into discrete classes: churners and non-churners. The primary approach involves training a model on historical data, where the features may include customer demographics, usage patterns, service satisfaction levels, and other relevant metrics. Popular algorithms include Logistic Regression, Decision Trees, Support Vector Machines, Random Forests [13], and Naive Bayes [5] classifiers. Each model offers a prediction that a customer will or will not churn, typically as a binary outcome. Generally, the models are evaluated using the confusing matrix, ROC curve and ROC-AUC score, precision, recall, and F1 score.

Survival Analysis in Churn Prediction

Survival analysis differs from traditional classification by not only predicting whether an event will happen (i.e., churn) but also when it might occur. This information gives the opportunity to detect when the most probable time it could be and to use the means for its prevention. For this reason, it is important what factors affect customers’ churn and what is their influence — greater or lesser extent. Though the phrase “survival analysis” evokes a medical study, the applications of survival analysis extend far beyond medicine. In fact, this approach is particularly suited to churn prediction as it can handle right-censored data — cases where the churn event has not yet occurred or is not observed within the study period. [6] In other words, in the context of churn prediction, the event time (T) and censoring time (C) are two critical concepts:

Event Time T: This is the actual time at which the churn occurs. For a customer, this is the moment they discontinue their services or switch to a competitor. It marks the point of interest in our study — when the customer churns.
Censoring Time C: Censoring occurs when the observation of the event is incomplete. This might happen if the customer has not yet churned by the end of the observation period, if they renew or upgrade their contract (which might reset the churn prediction clock), or if they are lost to follow-up (e.g., they opt out of data collection). Censored data are those for which we do not observe the event happening during the study period, but it doesn’t necessarily mean they will not churn after the study ends.

Figure 2. A representation of event time and censoring time.

Understanding the distinction between these two times is crucial. Event time gives us the exact duration after which churn happens, providing a straightforward, valuable outcome for analysis. In contrast, censoring time indicates the limitations of our data collection — telling us when our observation window for a customer closes without witnessing churn.

In survival analysis, the choice of the loss function is crucial as it directly influences how well the model can estimate the time until an event. Commonly used loss functions include the Cox partial likelihood for Cox proportional hazards models, which is designed to handle censoring effectively. Other approaches might use adaptations of traditional loss functions to accommodate the specific structure of survival data, such as log-likelihood functions for parametric survival models [6].

The accuracy of survival models is often assessed using metrics like the concordance index (C-index), which measures the model’s ability to correctly rank the survival times of pairs of subjects. It is defined as the proportion of concordant pairs divided by the total number of possible evaluation pairs. Another metric is the Brier score, which provides a measure of how well the model’s predictions match the actual outcomes over time, taking censoring into account. It is computed as the square of the largest possible difference between a predicted probability (which must be between zero and one) and the actual outcome (which can take on values of only 0 or 1).

Several models are prevalent in survival analysis, such as the Cox Proportional Hazard Model, which is a semi-parametric model assuming that the effect of the measured variables on the risk of churn is constant over time and computes hazards based on this assumption; the Kaplan-Meier Estimator, which is a non-parametric statistic used to estimate the survival function from lifetime data, suitable for calculating the probability of survival at certain time points; and the Parametric Survival Models, which assume a specific distribution for the event times and can be more efficient than non-parametric methods when the assumptions hold true.

Deep Learning Models for Churn Prediction

Deep learning models represent a significant evolution in the field of survival analysis, particularly in their capacity to handle complex, non-linear relationships within large datasets, which traditional statistical models might struggle with. These models leverage the ability of neural networks to uncover intricate patterns and interactions that are not readily apparent or easily modeled through classical methods such as logistic regression or Cox proportional hazards models. DL-based survival methods can be classified in terms of three concepts related to model estimation. First, the model class describes which type of statistical survival technique forms the basis of the DL method, such as the Cox proportional Hazard model, or a parametric model. Second, the loss function is often a direct consequence of the model class (i.e. its negative log-likelihood). However, as is common in DL, some methods employ multiple losses for improved performance or multi-task learning. For instance, some DL-based survival methods compute a ranking loss, in addition to a standard survival loss, for improvement of the C-index performance measure. The final loss is usually computed as the (weighted) average of all losses applied. Third, the parametrization determines which model component is being parametrized by an NN. The standard model parametrization is usually implied by the model class [4].

Models comparison

In churn prediction, the choice of modeling approach is critical and should be tailored to the specifics of the dataset, the assumptions about data, and the desired outcome of the analysis. Here, we compare binary classification models, survival models, and deep learning survival models, focusing on theoretical considerations and underlying assumptions. It is important to note that a direct comparison of the performance of binary classification and survival analysis models is not possible because of the different types of output between the two model classes, and the consequent lack of a universal metric valid for both.

Figure 3. Summary table about the theoretical comparison between model classes.

Binary classification models are based on the assumption that outcomes can be distinctly classified into categories (churners vs. non-churners). They are highly effective for datasets where the primary goal is to identify the presence or absence of an event without a focus on the timing of the event. These models are generally simpler to understand and implement but may not capture complex behaviors or temporal dynamics present in the customer life-cycle. Unlike binary classification, survival models account for the time dimension, handling right-censored data where the event (churn) has not occurred by the end of the observation period. These models are preferable for datasets where the timing of the event is crucial, as they provide insights not just on if, but when an event is likely to occur. This makes them particularly useful in settings where customer follow-up is possible over extended periods. Different models within this class make strong assumptions about the distribution of the data or about the features’ independence. Deep learning survival models do not require specific distributional assumptions about the data and are capable of capturing complex, nonlinear relationships within large datasets [4].

Summary Theoretical Analysis

Binary classification models are particularly effective when the primary interest is in predicting whether an event (churn) will occur, without the need to understand when it will happen. They are best used when the dataset is large enough to avoid overfitting but not so large or complex that simpler models fail to capture important patterns. Survival models are superior when the timing of churn is as important as determining whether churn will occur. They are particularly useful for datasets where censoring is a significant factor. These models excel in longitudinal analysis where data is collected at multiple time points.

Case study: Telco Customer Churn

In this case study, we apply several predictive models to the Telco Customer Churn dataset to assess and compare their effectiveness in predicting customer churn. The models tested include a traditional binary classification model (Random Forest), two survival analysis models (Cox Proportional Hazards Model and LightGBM Survival), and an advanced deep learning model (DeepHit). The primary goal is to evaluate how well each model predicts churn, considering both the occurrence and timing of churn events.

Dataset and Pre-processing

The Telco Customer Churn dataset contains customer data from a telco company, including demographic details, account information, and churn status. We pre-process this data by encoding categorical variables, standardizing numerical inputs, and appropriately handling missing values. The target variable is ‘Churn’, which we use directly in the Random Forest model and convert into duration and event indicators for the survival models.

Model Training and Evaluation

Each model is trained on the pre-processed dataset:

Random Forest: A binary classifier known for its robustness and ease of use.
Cox Proportional Hazards Model: A staple in survival analysis, used here to incorporate the timing aspect of churn.
LightGBM Survival: An implementation of gradient boosting for survival analysis, chosen for its efficiency and performance.
DeepHit: A deep learning approach specifically designed for survival data, capable of handling complex patterns.

The models are evaluated based on ROC-AUC for the Random Forest and the concordance index (C-index) for the survival models. These metrics help in understanding the predictive accuracy and the capability of the models to rank the churn times correctly.

Results and Visualizations

Our analysis focused on evaluating four different predictive models: Random Forest, Cox Proportional Hazards (Cox PH), Gradient Boosting Machine (LightGBM), and DeepHit. The first visualization presents a comparative overview of each model’s performance based on selected metrics.

Figure 4. Performance metric value for Random Forest classifier( ROC-AUC) and survival models (C-index)

From the bar chart, we observe that all models exhibit strong performance, with metrics nearing or surpassing the 0.8 mark, indicating robust predictive capabilities across the board. While Random Forest shows slightly lower performance, Cox PH, LightGBM, and DeepHit demonstrate nearly equivalent high performance. This suggests that while traditional machine learning models like Random Forest are effective, advanced models, especially those tailored for survival analysis, might provide more nuanced insights due to their ability to handle time-to-event data. It is important to note that the C-index used to evaluate survival models it’s a rank-based model, while the evaluation of Random Forest is carried on through the ROC-AUC score, which evaluate the correctness of the prediction.

To further understand the driving factors behind churn predictions, we analysed the feature importance derived from each model. The following visualizations detail which features contributes most significantly to predicting customer churn across different modeling approaches.

Figure 5. Feature importance for Random Forest and survival models. DeepHit needs agnostic methods to compute feature importance, as permutation importance.

In the Random Forest model, the features related to payment methods, particularly electronic checks, are highly influential. This suggests that the choice of payment method significantly impacts customer churn decisions. Additionally, contract type (e.g., two-year contracts) and tech support availability also play crucial roles, indicating that longer-term commitments and customer support quality are key to retaining customers. Similarly, the Gradient Boosting model highlights payment methods and contract types as critical. However, it also shows that internet service type, specifically fiber optic, is a decisive factor, possibly due to its implications on service quality and customer satisfaction. The Cox model reveals a broader range of influential features, with tenure and total charges leading. This emphasizes the impact of customer loyalty and cumulative spending on churn risk, aligning with the notion that more engaged and higher-spending customers are less likely to churn. For the DeepHit model we do not have a model built-in way to compute feature importance, and analyzing it requires advanced techniques like SHAP or permutation importance to fully capture the influence of each feature.

Conclusion

In this article we have explored various machine learning approaches to predict customer churn, illustrating the possible application of each model through both theoretical discussion and practical application. By comparing traditional binary classification models, survival models, and advanced deep learning models, we have uncovered insights that are critical to choosing the most effective approach for predicting customer churn.

The case study using the Telco Customer Churn dataset demonstrated that while binary classification models like Random Forest are straightforward and efficient for predicting whether a customer will churn, they lack the ability to provide insights into when the churn might occur. This limitation is significant in strategic business planning where timing can be as crucial as the prediction itself. On the other hand, survival models such as the Cox Proportional Hazards Model and LightGBM Survival offer the advantage of handling censored data and providing estimates on the timing of churn. These models are invaluable for businesses that aim to intervene proactively to retain customers at risk of churn. They allow companies to schedule their marketing and retention strategies more effectively, potentially extending the lifetime value of the customers. Deep learning models, like the DeepHit model, are a valid option on the churners' inference, but lack of interpretability.

In conclusion, the choice of the model has to be evaluated over many factors, such as the input dataset, the business aim, and the technology available. A promising yet underexplored approach to enhancing churn prediction could involve creating an ensemble that combines binary classification models with survival models. By integrating these methodologies, it might be possible to achieve a more comprehensive and accurate prediction system that not only identifies potential churn but also predicts the timing of such events, offering a powerful tool for strategic customer retention planning.

Reference

[1] De S, Prabu P, Paulose J. Effective ml techniques to predict customer churn. 2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA). IEEE, 895–902, 2021.
[2] Ahmad, A.K., Jafar, A. & Aljoumaa, K. Customer churn prediction in telecom using machine learning in big data platform. J Big Data 6, 28 (2019). https://doi.org/10.1186/s40537-019-0191-6
[3] https://survival-org.github.io/DL4Survival/
[4] Deep learning for survival analysis: a review Simon Wiegrebe, Philipp Kopper, Raphael Sonabend
[5] Predict Customer Churn based on Machine Learning Algorithms, Zhiqing Liang, 2023
[6] Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. (2013). An introduction to statistical learning : with applications in R. New York :Springer
[7] HitBoost: Survival Analysis via A Multi-output Gradient Boosting Decision Tree Method, Simon X. Yang, Bo Fu, Pei Liu, 2019
[8] D. R. Cox, Regression Models and Life-Tables
[9] Greedy function approximation: A gradient boosting machine, Jerome H. Friedman, 2021
[10] J. L. Katzman, U. Shaham, A. Cloninger, J. Bates, T. Jiang, and Y. Kluger, DeepSurv: Personalized treatment recommender system using a Cox proportional hazards deep neural network
[11] C. Lee, W. R. Zame, J. Yoon, and M. van der Schaar, Deephit: A deep learning approach to survival analysis with competing risks
[12] K. Ren, J. Qin, L. Zheng, Z. Yang, W. Zhang, L. Qiu and Y. Yu, Deep Recurrent Survival Analysis
[13] Customer Churn Prediction by Classification Models in Machine Learning, Heng Zhao; Xumin Zuo; Yuanyuan Xie

Strategic Customer Retention: A Comparative Analysis of Churn Prediction Models

Introduction

Theoretical background

Case study: Telco Customer Churn

Written by mzanatta