End-to-end churn prediction on Google Cloud Platform — Part 2

This is the second post related to Churn Prediction on Google Cloud Platform. If you didn’t read the first one, feel free to do it

Wide & Deep model

Among features available for Churn Prediction, there were numerical features (dense) and some sparse categorical features with large cardinality (large number of unique values). Methods like Linear and Logistic Regressions (Wide models), trained with L1-regularization, are usually very efficient to deal with highly sparse features, as most weights of unimportant features are pushed to zero. On the other hand, Neural Networks successfully combine dense features in a number of layers, but usually do not perform well with sparse features.

For this project we’ve decided to test the Wide & Deep architecture proposed by Google, which trains together a Wide and a Deep model. Fortunately, an implementation is already available on TensorFlow Estimator API. According to its creators, the Wide model is able to memorize patterns observed in the train set (like feature interactions) and the Deep model is more capable to generalize to unseen data. Figure 1 presents the Wide & Deep model.

Figure 1: Wide and deep model.
Source: link

Although not mandatory, working with feature columns makes life easier when building models with Estimator API. Think of a feature column as an intermediary entity between the raw data and the model itself. It helps the model to interpret/transform the data that comes from the input function.

The following feature column types are available:

In order to create numerical columns, it was used the tf.feature_column.numeric_column function, as shown on the snippet below.

Encoding numerical features

numerical_features = []
for feature_name in numeric_features:
numerical_features.append(
tf.feature_column.numeric_column(feature_name))

The snippet above will not create any variable for the model, but instead, it’s just defining a schema for the data that will be fed further during the training step.

For categorical features encoded as string, tf.feature_column.categorical_column_with_identity was used, once it will, under the hood, automatically map an input to a one-hot output. It is important to highlight that one needs to explicitly inform the vocabulary size, in other words, it is necessary to know a priori all possible values for a given categorical feature. If you are dealing with a high cardinality feature, or even you don’t know how many possible values it may have, consider using tf.feature_column.categorical column with hash_bucket or tf.feature_column.embedding_column. The snippet below depicts how we’ve done it.

Encoding categorical features for the wide model

categorical_features = []
for feature_name,vocab_size in categorical_features:
categorical_features.append(
tf.feature_column.categorical_column_with_identity(feature_name, vocab_size))

categorical_features = []
for feature_name,vocab_size in categorical_features:
 categorical_features.append(
 tf.feature_column.categorical_column_with_identity(feature_name, vocab_size))

Feature crossing was also performed as an attempt to let the linear (wide) model learn relevant feature interactions for this prediction problem. Therefore, as we didn’t have business knowledge about all variables, we’ve performed paired combinations among all categorical variables . With such approach, the dimensionality of the problem exploded, and the wide model training time became a bottleneck. The output checkpoint file was over 1.9GB, but once the inference step would be done in batch mode, issues of model size and response time were not critical.

Configuring feature crossing for the wide model

HASH_KEY = 23
full_pair_combination = itertools.combinations(categorical_features, 2)
crossed_columns = []
for pair in full_pair_combination:
bucket_size = pair[0].num_buckets * pair[1].num_buckets
crossed_columns.append(tf.feature_column.crossed_column(
pair, hash_bucket_size=bucket_size, hash_key=HASH_KEY))
wide_columns = categorical_features + crossed_columns

In the deep model, feature interactions are nativelly performed by neural network layers. Categorical features were represented as feature embeddings for the deep model.

Creating embeddings for categorical features for the deep model

embedding_features = []
EMB_CONST = 2
def get_embedding_size(const_mult, unique_val_count):
return int(math.floor(const_mult * unique_val_count ** 0.25))
for feature in categorical_features:
embedding_size = get_embedding_size(
EMB_CONST, feature.num_buckets)
embedding_features.append(tf.feature_column.embedding_column(
feature, embedding_size))
deep_columns = numerical_features + embedding_features
def get_embedding_size(const_mult, unique_val_count):
return int(math.floor(const_mult * unique_val_count ** 0.25))
for feature in categorical_features:
embedding_size = get_embedding_size(
EMB_CONST, feature.num_buckets)
embedding_features.append(tf.feature_column.embedding_column(
feature, embedding_size))
deep_columns = numerical_features + embedding_features

Finally, after defining all model variables using feature columns, the Wide & Deep model was instantiated as follows. An optimizer using the efficient Follow-The-Regularized-Leader (FTRL) method was used by the wide model, and Proximal Adagrad optimizer for the deep model.

How to instantiate the Wide and Deep model

estimator = tf.estimator.DNNLinearCombinedClassifier(
model_dir=model_dir,
linear_feature_columns=wide_columns,
linear_optimizer=tf.train.FtrlOptimizer(
learning_rate=FLAGS.linear_learning_rate,
l1_regularization_strength=FLAGS.linear_l1_regularization,
l2_regularization_strength=FLAGS.linear_l2_regularization),
dnn_feature_columns=[100,25,5],
dnn_hidden_units=deep_hidden_units,
dnn_dropout=FLAGS.deep_dropout,
config=run_config,
weight_column=’weight’,
dnn_optimizer=tf.train.ProximalAdagradOptimizer(
learning_rate=FLAGS.deep_learning_rate,
initial_accumulator_value=0.1,
l1_regularization_strength=FLAGS.deep_l1_regularization,
l2_regularization_strength=FLAGS.deep_l2_regularization))

Training

ML Engine

Google ML Engine, a serveless service for training and deployment of machine learning models, was decisive throughout this six-weeks project, as we could execute more than 2,000 experiments in parallel, totalizing over 15,000 hours of job training.

Hypothesis driven development

In machine learning projects, we usually follow the scientific method, where a hypothesis is first stated, evaluation metrics are defined, experiments are performed, and results are analysed before the next cycle.

Some of the hypothesis we evaluated was:

Will usage of GPU decrease training time without increasing job cost?

Once dealing with tabular data, instead of audio and images, the computational power brought by GPU usage would not be so beneficial for our purposes given the ratio cost per training step.

This hypothesis was measured in terms of US$/training step.

Some experiments were performed by varying the number of training steps and the number of workers provisioned.

At the end, we’ve observed that GPU brought, on average, a speedup of 1.2x on total training time, but at 3x higher cost. Thus, we concluded that GPU would not worth for this type of problem and following hypotheses were evaluated only on CPU.

Would an approach to weight higher positive samples (churned customers) improve Precision & Recall?

The dataset provided showed to be pretty unbalanced, as only 1% of the customers cancelled the subscription of the service (churn). Some approaches to deal with unbalanced datasets are: discarding samples from the majority class, oversampling the minority class, data augmentation or weighting loss function to force the backpropagation to give more importance on errors made to the minority class.

We chose the last approach and evaluated how different weights affects F1-Score.

We defined the weight W as the complement of the churn rate. For example, if the churn rate in a given dataset is 1%, then W = 0.99, and the weight of 0.01 will be applied to non-churn customers.

This weighting scheme was compared along with four other alternatives scaling a W by a factor , given by: 10–1* W, 5 *10–1* W, 5 *100* W, 101* W.

After experiments, the conclusion was that the higher the W, the higher the number of false positives and the lower F1-Score. On the other hand, the lower W, the lower the absolute number of positive predictions. Thus, for the business perspective, setting W without a scaling factor showed to be the most appropriate choice.

Other hypotheses also investigated in this project (not addressed in this post) are described below:

  • Will Deep model architectures will strongly affect Precision & Recall compared to linear models?
  • Will the usage of XGBoost for feature selection be able to highlight a subset of features that will allow the training of less complex models, without a significant loss of precision?
  • Will Wide & Deep model perform better (in terms of Precision & Recall) than Wide or Deep models alone.
  • Can tuning hyperparameters like learning rate, batch size and L2-regularization coefficient affect strongly Precision & Recall?

Scalability and Hyperparameter Tuning

As said earlier, we managed to run hundreds of training jobs during a 6-week project thanks to Google ML Engine, as it allows you to run several jobs in parallel painlessly. Variations from the same hypothesis were evaluated all at once, in a way we could test each hypothesis within 2–3 days and discuss results very often with our client.

A powerful feature from ML Engine is the Hyperparameter Tuning, where a smart grid search is performed through a parameter space towards optimizing some metric of interest. The smart part comes from a Bayesian Optimization to search wisely on the hyperparameter space .he Hyperparameter Tuning feature was used for some experiments, like searching the best deep architecture, and training aspects like learning rate and batch size.

Too good to be true

Throughout the project, different versions of the dataset (in TFRecords format) were created, as the ETL step were being continuously improved. At a given moment, extremely good results were being reported by some training jobs and a red flag was raised, once it would be very improbable to be correct. The rule of thumb is : “If results gets too good to be true, check for data leak!”

After some investigation, we’ve detected and fixed a data leak on the ETL step. Unfortunately, some instances from training leaked for the test set.

This unfortunate situation called our attention to develop automated tests throughout all data pipeline steps to guarantee not only no data leakage, but also consistency on normalization steps. After that, all further TFRecords versions were automatically checked and we had no more headaches.

Conclusion

This post concludes this series on how we’ve used Google Cloud Platform capabilities to build an end-to-end machine learning application to tackle a churn prediction problem. It was covered the main aspects of Exploratory Data Analysis, Feature Selection, ETL, Training and Deployment on Google ML Engine.

Some important lessons were learned throughout this project:

  • GPUs is not always cost effective for all machine learning problems. The balance between Cost / Time speedup should be analyzed for your project.
  • It is of utmost importance to implement tests routines to check for data leakage and integrity in the ETL pipeline
  • TensorFlow is a cutting edge framework that one can easily switch a code from experimentation to production seamlessly
  • Serveless services, such as Google ML Engine and Cloud DataFlow, simplify anyone’s work once no time is spent on provisioning and configuring workstations
  • The parallelization power brought by cloud computing does accelerate hypothesis driven development at every step of data pipeline