Customer retention (Part 2 of 2): Framework architecture

Yasmin Bokobza
Data Science at Microsoft
13 min readFeb 13, 2024

By Yasmin Bokobza, Sharath Kumar Rangappa, Swarnim Narayan, and Kiran R

As described in Part 1 of this two-part article series, Customer Retention as a Service (CRaaS) is a generic Machine Learning–based framework implemented in Python. It produces completely automated churn predictions, determines the causes of churning behavior, and generates churn explanatory text that marketing teams can leverage to reduce customer churn and improve retention rate. In addition, this service can be tuned easily, reduces delivery time, enables quality monitoring of large volumes of data, amortizes costs, reduces maintenance efforts, and optimizes resources.

CRaaS provides multiple alternatives within a Machine Learning churn analysis pipeline, and end users and data scientists can configure these different parameters through a JSON file. Examples of parameters that can be configured include a list of desired churn prediction windows, an upper and a lower threshold to divide the accounts into churn risk buckets, methods to transform a categorical variable into a continuous variable, feature selection methods, ML algorithms, and techniques to balance the volumetrics of both target classes (such as churn and active), among others. Automatic customer churn predictions are widely reviewed in the literature, and there are many methods adjusted for different business use cases. Our approach is driven by both the nature of the business retention tasks that we have encountered at Microsoft as well as the challenges involved in customers’ own churn predictions at scale.

The high-level architecture of the CRaaS service can be divided into four stages, as depicted in Figure 1 (from the bottom up, Feature Engineering, ML Data Preprocessing, ML Model Building and Model Output Postprocessing). The CRaaS ML service can be deployed as a REST API using an Azure Machine Learning (AML) pipeline.

Figure 1: High-level architecture of the CRaaS as a REST API.

CRaaS Feature Engineering stage

The first stage aims to extract generic features curated from various sources and filter data based on time and entity for different churn prediction models. By building a separate and central feature engineering stage we prevent redundancy, allow flexibility in adding databases and features, and reduce soft outages and the operation cost of managing individual feature engineering pipelines for different models, all of which enables ML scientists to create desired aggregations and transformations on the generic features in a separate pre-processing step.

Figure 2: Components of the CRaaS features engineering pipeline.

In addition, providing the start date and the desired number of historical months enables the pipeline to query the Feature Store to dynamically generate features with the necessary aggregations. The pipeline can be scheduled daily, weekly, or monthly based on specific requirements. The output of the pipeline is written to ADLS Gen2 storage for further ML processing. Figure 2 illustrates the data flow and components of the pipeline. Azure Data Factory (ADF) is used for orchestration, while Spark is employed to process large volumes of data.

CRaaS ML Data Preprocessing stage

Next, the ML Data preprocessing stage is conducted. In this stage, we extract features from the raw data, ensure the quality and relevance of the input data, and apply the labeling mechanism to define which of the customers in our database are still active and which are not. While there are several widely used methods for some of the ML data pre-processing steps, we use methods that are relatively fast, simple, and efficient, so we can deal with the challenges involved in producing churn predictions at scale. The pre-processing steps can be divided into seven main actions:

1. Dataset splits: The data is split into distinct train, validation, and test sets. The division is aligned with the desired prediction windows, employing the rolling-window validation approach. This involves selecting a cutoff date for model tuning and another for evaluation, enabling a comprehensive assessment of the model’s performance over time when the time range for evaluating model performance is defined by the size of the prediction window. Figure 3 depicts the rolling-window validation approach. Based on the cutoff date the training input is defined in the desired size and the churn predictions are evaluated compared to the actual status of consumers in the prediction window. The number of cutoff dates and the lag between them can be defined.

Figure 3: Rolling-window validation approach.

2. Features extraction: Features are extracted for the train, validation, and test sets by employing focused aggregations and transformations on the general features that were generated during the initial feature engineering stage, driven by the selected cutoff dates. In addition, missing values are handled during this stage as well.

3. Features encoding: This entails utilizing encoding methods such as “One-Hot Encoding” and “Target Encoding” to transform our categorical features into numerical format, facilitating their compatibility with the Machine Learning algorithm. We have found these encoding techniques to be instrumental in handling our diverse dataset effectively.

4. Outlier removal: Outliers introduce bias in the model parameter estimates. This means that outliers can strongly influence predictions. In CRaaS we implemented Tukey’s test [41] for the identification and removal of outliers in our dataset. Tukey’s test is a statistical method that defines an outlier to be any observation outside the range [Q₁ — k * IQR, Q₃ + k * IQR] where Q₁ and Q₃ are the lower and upper quartiles, respectively, and IQR = Q₃ — Q₁. k is a constant; the greater its value, the more extreme outliers are removed. While there are several widely used outlier detection methods for removing anomalies, we have found that Tukey’s test is the most fast, simple, and efficient for the business retention tasks we have encountered.

5. Data labeling: The data labeling mechanism defines a customer’s status based on their usage behavior. The definition is derived from the practical use cases for businesses that we have encountered and the experiments we have conducted. The flexibility of this mechanism also allows incorporation of churn rules tailored to the specific business use case, configured via a JSON file.

6. Dimensionality reductions: Users can decrease dimensionality by activating feature selection in the configuration JSON file. An example feature selection method might entail the computation of correlations between feature pairs, and any features exhibiting correlations surpassing a predefined threshold are then evaluated against their correlation to the target variable. In each pair, the feature with the highest correlation is retained, ensuring the selection of a feature set that has the highest impact on the churn prediction.

7. Database balancing: Recognizing the common challenge of imbalanced datasets that are often seen in churn prediction scenarios, the preprocessing phase incorporates a database balancing step. Data scientists can use data sampling methods such as SMOTE and ADASYN to rectify the classes’ distribution imbalance. This transformation aims to improve model performance by ensuring a more balanced representation of classes, a critical aspect in churn prediction given the inherent class imbalance.

CRaaS ML Model Building stage

After the feature engineering and ML data preprocessing stages, ML model building is conducted. In this stage the Optuna hyperparameter optimization framework is leveraged to fine-tune key parameters of the selected ML algorithm in the configuration JSON file. To streamline the process of building models while reducing redundancy, we provide an easily extensible model hub that includes different Machine Learning models. The selection of the algorithms is driven by the business tasks and data characteristics that encompass the data size that we have encountered as well as the model’s popularity.

In the quest for performance enhancement, we focus on fine-tuning a set of parameters that have the greatest effect on optimizing the model’s evaluation metrics. Furthermore, we incorporate cross-validation into our model building stage to ensure robustness and mitigate overfitting concerns. Cross-validation entails partitioning our dataset into subsets, using a portion for training and the remainder for validation. This iterative process provides a comprehensive assessment of the model’s performance across various folds, which ultimately contributes to a more reliable estimate of its effectiveness.

To enable users to evaluate the model’s performance, we provide validation metrics that are typically tailored to the unique requirements of the business or the data scientist’s comprehension of the Machine Learning problem. Currently, we provide support for two types of metrics reporting for model evaluation:

1. Classification Report: Leveraging Scikit-learn, we furnish users with a report encompassing various metrics. In cases where applicable, this report provides both macro and weighted averages, delivering insights into precision, recall, F1 score, and overall accuracy on a per-class basis.

2. AUC: We include the ROC-AUC as part of our reporting, providing an additional performance indicator.

Model Output Postprocessing stage

As a part of our comprehensive approach for customer churn prediction and effective communication with the marketing team, we have categorized customers into churn risk buckets of low, medium, and high based on their churn probability, stemming from predefined thresholds specified in a JSON file. Additionally, we are generating explanatory churn text using SHapley Additive exPlanations (SHAP) [1]. This process serves as a bridge between the predictive power of the ML models and insights for the marketing team. By segmenting customers into these churn risk buckets, we allow the marketing team to deploy targeted campaigns that resonate with specific customer profiles, ultimately leading to improved retention efforts.

The integration of SHAP adds a layer of interpretability to our churn prediction models. SHAP is a powerful technique that provides insights into the contribution of individual features in driving model predictions. This means that we can not only predict which customers are likely to churn, but we can also understand the reasons behind those predictions. SHAP values help to quantify the impact of each feature on a customer’s churn prediction, whether it’s their historical consumption behavior, services utilizations, or other relevant attributes.

By leveraging the customer segmentation and SHAP, we generate explanatory text that outlines the key factors influencing a customer’s likelihood to churn. This text provides a clear and concise explanation that the marketing team can use to craft personalized messages and offers for each customer. For example, when a customer is identified as having a high probability of churning, the explanatory text emphasizes the churn indicators and their trends, such as increases or decreases.

By utilizing the explanatory text, the marketing team gains deeper insights into the driving forces behind customer churn. This empowers them to tailor their messaging and strategies to address specific pain points and increase customer retention. Ultimately, this approach enhances the team’s ability to proactively connect with customers, mitigate churn, and foster a stronger and lasting relationship with customers.

The CRaaS outputs are churn predictions of the most accurate model presented in a fixed schema, as shown in sample output in Figure 4, which assumes that the churn low and high bucket thresholds are 0.3 and 0.7, respectively. The schema provides information about the customers’ ID, the prediction window and granularity, the churn probability, churn bucket. and the predictions explanation.

Figure 4: Sample output of the CRaaS.

CRaaS ML service as a REST API

The pipeline’s trigger for training and inference can also be wrapped as a REST endpoint to enable client-stack agnostic interfacing. For this API, Azure Data Lake Storage Gen2 (ADLS) serves as the data plane. This implies that all data inputs and outputs must lie on ADLS, as we shall see shortly.

The computes for the service are a part of an Azure Machine Learning environment. We use Azure Machine Learning pipeline abstraction to generate the endpoint, which also acts as the control plane. This enables authentication to be handled by Azure Machine Learning and therefore aligns our service with standard practices followed by the Azure stack. The computes themselves run custom, performance-tested CRaaS docker images that bundle the dependencies and the actual CRaaS training and inference code. The docker images have been tested with a wide variety of CPU machine flavors (i.e., varying number of cores and memory).

As part of performance testing the CRaaS image on various computes, we ensure that the code can latch on to multiple cores and use memory efficiently. The inputs to the service are the ADLS paths to the parameter JSON and the actual input data. The service outputs a prediction file, archives the model to ADLS, and links it to the job. In this flow, a model archive enables traceability and reproducibility of runs.

We summarize the flow as follows: The users first authenticate with the Azure ML workspace in which the service is deployed. The input file, along with the parameter JSON, is placed on ADLS. The service is then triggered by the paths to these. Job progress can be tracked through the Azure ML Pipeline SDK. The service writes back the prediction file and model artifacts to ADLS. Figure 5 illustrates the integration of batch by Azure ML endpoints, where Azure Data Lake Storage (ADLS) serves as the data plane, and the execution code is containerized.

Figure 5: Batch endpoints powered by Azure ML endpoints. ADLS is the data plane and execution code is containerized.

Results and validation

To assess performance of the CRaaS, we compare it to a churn model that utilizes the XGBoost algorithm [2]. To ensure a fair comparison of ML model performance between the Churn XGboost model and CRaaS, we utilized identical feature snapshots for the same group of customers during a defined period. The prediction window was chosen based on the requirement of the main use cases we deal with. Using the input features, we have built two ML models. The CRaaS was trained using various available options, with hyper-parameter tuning conducted using Optuna, employing five-fold cross-validation, and utilizing a different search space depending on the algorithm. According to the results shown in Table 1, the CRaaS outperformed the existing Churn model when using the CatBoost algorithm, yielding the best results.

Table 1: Performance of two models.

To evaluate the effectiveness of the CRaaS ML service, we use the cumulative gain curve that provides an understanding of how well CRaaS performs in identifying potential churn customers compared to the Churn XGboost model and random selection model, as shown in Figure 6.

The results indicate that CRaaS is outperforming other models. For example, by selecting the top 20 percent of customers with the highest churn probabilities as determined by CRaaS, we encompass approximately 86 percent of all customers who are likely to churn. This represents approximately 1.8 times and approximately 4.3 times the count of actual churning customers identified through the Churn XGboost model and with random model, respectively. That is, by using CRaaS and focusing on the top 20 percent of customers, we are effectively capturing a large portion of potential churn customers, which demonstrates the practical contribution of CRaaS in customer retention efforts.

Figure 6: comparison between the performance of CRaaS, the existing Churn model, and random selection model using cumulative gain curve.

To gain a better understanding of the features influencing customer churn prediction, we employed a SHAP multiple-bar plot that presents a global summary of feature importance separately for both churned and active customer cohorts. Figure 7 depicts the mean absolute SHAP value for each feature column as a bar chart. Looking at the top three features with the most substantial effect on predicting each cohort, a clear distinction emerges.

For churn customers, the percentage changes in the utilization of storage services, change in the percentage of revenue, and change in the percentage of overall consumption all exhibit the highest impact on predictions, and their feature importance values differ significantly from the active customers. However, for active customers, a distinct set of features — average revenue, consumption duration in months, and consumption slope — have the highest influence on the prediction, with their feature importance values also showing marked differences from the churn customers. This underscores the notable divergence in feature importance between the two cohorts.

To assess the impact of the top features identified by SHAP values, we compared the mean values of these features for the two cohorts — churned and active customers. Our analysis revealed insights into the influence of these top features on customer churn. For example, by focusing on the top feature that affects each cohort, we found a significant difference in the percentage change in the storage services utilization per month. When examining customer usage during the hedged period, we calculate the percentage change by comparing usage at the beginning of the period (e.g., over the initial 30 percent of the time) with usage at the end of the period.

Figure 7: SHAP values multiple-bar plot that presents a global summary of feature importance separately for both churned and active customer cohorts.

A lower percentage change indicates a significant decrease in changes by the end of the period. Churned customers showed a reduced monthly percentage change in storage services utilization, measuring 137.26 consumption units, whereas active customers demonstrated a notably higher value of 2822.94 consumption units. Similarly, we observed that churned customers had a lower mean monthly revenue of $104.90 compared to active customers, who had a higher mean monthly revenue of $8409.08. The comparison of mean values for the rest of the top features identified by SHAP values is shown in Table 2. The examination of the mean values of these features underscores their importance in understanding the customer churn prediction.

Table 2: Mean values for the top features identified by SHAP values.

Conclusion

In this article we discussed details of the CRaaS methodology, including how Microsoft Azure can be used in the feature engineering stage and the framework wrapped as a REST endpoint. In addition, we shared an evaluation result by leveraging real customer data. We hope this article, and the series it is part of, helps you with your own business problems. Please leave a comment to share your customer retention scenarios and the techniques you are using today.

We’d like to thank Casey Doyle for helping review the work.

References

1. Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in neural information processing systems, 30.‏

2. Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785–794).

See the first article in this two-part series‏:

--

--