End-to-end churn prediction on Google Cloud Platform
Churn prediction
Churn prediction is a classic Customer Relationship Management problem which consists in detecting users who are most likely to cancel the subscription of a service.
This is the first post, out of two, describing an approach to tackle the churn prediction problem using Google Cloud Platform (GCP) on a end-to-end view, comprising: data acquisition, data wrangling, modeling, model deployment and a business use case. This is based on a real use case we’ve worked on for 6-weeks, for a US service company.
This first post covers an overview of the proposed solution architecture and tools leveraged for Exploratory Data Analysis and Feature Selection.
The next posts will cover the model distributed training and deployment on ML Engine.
Exploratory Data Analysis (EDA)
Mandatory on any data science problem, Exploratory Data Analysis (EDA) provided important insights about the dataset we were handling before any modeling. Let’s present the tools that made it possible to deep dive on data.
Dataprep
Google Dataprep was essential for fast data exploration. It can ingest data directly from Google BigQuery or Google Cloud Storage, and automatically generates visualizations of data distributions for each attribute, detects possible mismatch values and depicts missing data.
With Dataprep, it was possible to quickly detect columns with unknown values (which the client was not aware of) and to perform some data cleansing tasks.
Even though Dataprep allows you to build a DataFlow transformation pipeline using the GUI, we’ve decided to code our own pipeline using TensorFlow-Transform because Dataprep doesn’t support User Defined Functions (UDF). Here is a list of DataPrep limitations.
Google Colaboratory
After a first tackle using Dataprep, we further explored the data with Google Colab.
Google Colab is an online and currently free Jupyter notebook environment that requires no setup. The goal to use Colab in this project was to have a productive environment to code, share and discuss the EDA with the client during our daily meetings. With Google BigQuery connector for Pandas (and seaborn data viz library) we were able to perform queries and easily generate visualizations, like simple distribution and histogram plots.
Such results made it possible to have productive meetings with the client about data cleansing, imputation and transformation strategies.
BigQuery
BigQuery is a serverless highly scalable data warehouse. It provides full integration with GCP platform, including tools used for our EDA (Google Colab and Dataprep).
For this project, the raw dataset was migrated from an on-premises DBMS to BigQuery.
Feature Selection
In Data Science projects, there are usually many features (attributes) available, whilst most of them do not add useful information for predictive models. If considered, those redundant features will increase model’s complexity and may cause overfitting on the training set.
Our client had some Churn Prediction ML models in place, developed by their Data Scientists. Their models used a subset of features selected using linear correlations and mutual information statistics, to remove possibly redundant columns.
Initially, we worked with the same features used by the client’s models, in order to perform apples-to-apples comparisons with their own ML models.
Afterwards, we managed to perform our own feature selection approach over the 857 features available. To measure the features importance for churn prediction model, we chose to train a trees ensemble model Gradient Boosted Decision Trees, implemented on XGBoost library, which naturally performs feature selection during its training.
XGBoost
At the time the project was running, Google ML Engine only supported TensorFlow jobs for model training, so we’ve used Google Compute Engine instances to train XGBoost. Fortunately, Google ML Engine now supports training jobs with XGBoost and Scikit-learn frameworks.
It is worthwhile to mention, and you’ll see further, how faster, and sometimes cheaper, is to train models on ML Engine than with dedicated GCE instances.
A major feature of TensorFlow compared with XGBoost and scikit-learn is its ability to scale and distribute the training when dealing with a big data scenario. Unfortunately, XGBoost required compute instances with large memory to train on the project data. But, how large? To answer this question we faced lots of unpleasant OutOfMemory error.
At the end of trial and error approach, it was required n1-highmem-96 instance (96 GB RAM).
In a nutshell, XGBoost is a decision tree based algorithm where trees are iteratively created to correct the errors made by previous trees until no further improvements can be made.
One advantage of tree-based models is its robustness to deal with numerical data at different scales without any preprocessing. However, categorical features still need some preprocessing. In our case, one-hot encoding was used to transform categorical features (encoded as strings) before training the model. After training, we calculated the importance (gain) of each categorical feature by summing up the importance of each derived one-hot encoded columns. The chosen metric of feature importance was the gain: “the improvement in accuracy brought by a feature to the branches it is on.”
Some experiments were performed to evaluate the model behavior configured using a different number of trees, trees depth, number of examples and features to be explored by each tree. Figure 1 presents results for the model with the highest precision.
We decided to select the top (most important) 90 features discovered after XGBoost training,for two reasons. First, after 90 features no huge improvement on gain is obtained; Second, we found a limitation on TensorFlow Transform where we couldn’t use more than 90 features. Probably this may have already been solved.
Dataset preprocessing pipeline
The Dataflow service was used to implement the data pipeline (data cleaning, imputation, normalization and export to TFRecord).
Dataflow is an Apache Beam runner for both streaming and batch jobs, with the ability to automatically provision and manage resources balancing between cost and throughput, without the need to tear up instances by hand.
Feature Engineering
As said earlier, it was used XGBoost to select the subset of 90 most important features among all 857 available.
Therefore, our client did not provide a data dictionary of the available features (a very common scenario in Data Science projects), as some of them came from third-party vendors. Thus, it was not clear by features names whether numeric features were categorical features encoded as integers our just counts.
To overcome this issue we’ve used two powerful statistical tools: the Benford’s Law & Kullback–Leibler Divergence Test.
The Benford’s law, or the first-digit law, states that the leading digit of any natural numerical distribution (e.g. counts) is likely to be small, in other words, for a given numerical distribution, the leading digit is more likely to be 1 than 9, for example. Figure 2 presents the expected distribution of the first digit according to the first-digit law, which can be observed to be followed in many different domains.
Our strategy was to detect if a numeric feature is a count by computing the similarity (Kullback-Leibler Divergence Test (KL)) between the feature distribution and the distribution expected by Benford’s law. If the KL divergence is low, we assume that the feature is a count, otherwise it is a categorical feature encoded as integer. For the KL test, we’ve used the scipy entropy function.
After the KL Divergence test, we could infer all feature types and implement suitable strategies for data imputation and normalization.
For example, for categorical features we imputed a default string for missing values. After imputation, a vocabulary for each categorical feature was created and their values were converted to a sequential numeric index.
For most ML models, it is important to scale and normalize features before training, so that numeric optimization methods can perform well. For numerical features, we’ve applied a normalization technique named z-norm, very common in ML models trained with gradient descent-based optimizers.
TensorFlow Transform
TensorFlow Transform (TFT) is a helper library to build a transformation pipeline in TensorFlow, which is used for both training and serving your model. Otherwise, you would probably need to implement two separate pipelines (clone code, prone to have inconsistencies) But, when models are deployed, the main point of an integrated data preparation pipeline is to ensure that all feature transformation steps are performed the same way for training and inference. Fortunately, TFT came to rescue and ensured consistency in our transformation pipeline.
In a nutshell, TFT will both transform your dataset and create a TensorFlow graph that can be appended to the graph of your model for later deployment. This feature was pretty useful for us but it had a limitation: only up 90 features could be transformed. We found out that the issue was caused because TFT builds an ApacheBeam (version 2.4.0) graph locally before sending it to Dataflow (version 0.6.0), and with more than 90 features the graph becomes too large to be uploaded. A possible workaround is to build the computational graph remotely, but we had not time to try that during the project.
GCP Architecture Overview
To summarize the content of this post, Figure 3 presents the system architecture diagram used on this problem. The steps are the following:
- Dump BigQuery dataset into .csv shards
- Preprocess features and transform dataset into TFRecord shards
- Train machine learning model (details on Post #2)
- Serve machine learning model (details on Post #2)
The next post will cover modeling and deployment of a neural network on Google ML Engine.