Gradient Boosting with High-level Tensorflow

Published in

Cazoo Technology Blog

8 min readApr 29, 2021

As the demands of the data science team grow here at Cazoo, we began an analysis of platforms that would help future-proof our Python machine-learning workflows.

Tensorflow Extended (TFX) seemed to suit most of our needs providing the capabilities to build customisable pipeline components for data exploration, feature engineering, validation and monitoring. Being able to implement high-performing, pre-made estimators would be invaluable for model prototyping inside a pipeline, but one of the main downsides to TFX is that it doesn’t currently offer any support for XGBoost or Scikit-Learn libraries. However, as well as it’s customisable APIs, Tensorflow offers many of it’s own out-of-the-box estimators including Tensorflow Boosted Trees, with both regression and classification flavours.

tf.Estimators

The TFBT classes sit over the Estimator API as one of it’s many pre-made model functions:

Tensorflow 2.0 API map, Google Inc. 2019

What the estimator class has tried to achieve is a balance between clear implementation, computational expense and extensibility for distributed learning. The methods implemented by the Estimator class will be familiar with Scikit-Learn users: train, evaluate and predict.

Gradient Boosting

Gradient Boosting is a mainstay of ensemble machine learning. GBMs offer high accuracy, are robust to outliers, can handle sparse and categorical data and work with a range of loss functions.

In a gradient boosting algorithm each tree is fitted to the regression tree from the previous step, minimising the objective function of the ensemble model iteratively.

For Scikit-Learn’s Gradient Booster, the underlying optimisation is divided into two parts:

find the negative gradient of the objective function with respect to the latest predictions (this just reduces to the residuals for a function like Mean Squared Error)
and the step length multiplied by a factor scaled by the learning-rate

XGBoost differs in that it incorporates the second order derivative or hessian instead of the second step. This is essentially a measure of curvature, informing how the gradient step will change as we vary the input. Without having to do a line search to compute the second part of the optimisation means XGBoost converges quickly.

The measure of how well a tree performs with respect to minimising the objective function is proportional to the sum of the gradients and hessians -the weight of each leaf node requires a value that will minimise this sum.

Instead of computing this measure for every possible tree structure, the algorithm calculates the information gain for each binary split as it traverses each level.

Scikit-Learn’s GBM adopts an exact greedy algorithm - every possible decision is considered at every split. For larger data sets, XGBoost has a setting to optimise with an approximate approach: first building quantile sketches for the original feature values, then bucket boundaries are used to determine an optimal split.

Layer-by-Layer

TFBTs try to provide the same improvements on GBM with a similar additive tree model utilising quantile sketches of features to determine optimal splits.

A second extension involves novel form of boosting:

TFTB architecture: “TF Boosted Trees: A scalable TensorFlow based framework for gradient boosting” https://arxiv.org/pdf/1710.11555.pdf

This asynchronous model is straightforward to employ using distributed training with Parameter Servers with a service like Google Cloud AI in conjunction with TFX. The work is distributed with the tf.distribute.Strategy API which replicates all of the current graph variables to each processing unit allowing for full utilisation in parallel.

A batch of samples is sent to each worker: it calculates the local quantile sketch of the feature values and pushes them back to the parameter server
The workers calculate the hessians and gradients for each bucket of values, which are combined and fed to an accumulator unit
At the end of the iteration where n_batches_per_layer is reached, the current layer is built and is added to the tree ensemble for the process to begin again

Workers use information from the ensemble to optimise the loss function for each batch and combined to get a statistical picture at a particular tree depth, before the layer variables are updated. Once the a layer is built, the best split for each feature is determined and ensemble information passed back to the workers. By computing statistics on every batch at each level, we’d expect a higher information gain at each split — effectively squeezing out more information from our sample training set before we hit deeper nodes, leading to less overfitting.

Test set-up

We set up a few experiments in order to test out the performance of the TFBTs in both a regression and classification problem with some public datasets with minimal pre-processing:

Regression: https://www.kaggle.com/camnugent/california-housing-prices
Binary classification: https://archive.ics.uci.edu/ml/datasets/census+income

Things we care about for this comparison:

Match performance metrics within an acceptable range on our example data sets
Ease of use
Can we use SHAP and LIME libraries with the outputs?

Things we don’t care so much about:

1. Training times. We’ll be using these models in TFX pipelines with infrastructure that drastically reduces training time like the parameter server example above.

Preparing the pipelines

For local projects, pipelines can be implemented in a very simple way with a few lines of code with Sci-kit Learn and by extension for XGBoost (since it’s methods are compatible). We’ve defined our cross validator generator function that feeds data in and our Pipeline() transformer which will implement a fit_transform() method on the data for every component inside the pipeline:

At production scale the equivalent workflows will be built with TensorflowTransform and Tensorflow in a TFX pipeline, ready for orchestration with Kubeflow. The pre-made Estimator input functions transform and batch the data, and create the iterator which will feed data into the model as tensors,

whilst the feature columns specify how the model should interpret the data.

Feature columns are analogous to the ColumnTransformer we’ve provided for the Scikit-Learn pipeline. Converting categorical and sparse data is costly, so luckily for us feature_columns abstracts a transformation which the Estimator will carry out under-the-hood. TFX pipelines will allow us to easily use the same transforms for predictions.

We have utilised the tf.Datasets.get_tensors_from_slices method to feed in a pandas DataFrame conveniently, but Datasets provides many other methods to help feed data into your models from memory. The feature_columns API also provides an easy way to define a custom normalisation which will perform the transformation on the tensors inside the input functions during batching.

Comparing performance

As both our data sets fit into memory, we instantiate our TF models using a batch_size = number_of_training_samplesand number_of_batches_per_layer = 1. The estimator will return the statistics for only a single batch per layer to be fed into the ensemble. This is equivalent to the both the Scikit-Learn and XGBoost algorithms, so it seems like a fair comparison to begin with. We define all three models with the same number of trees and learning rate:

Regression: comparing estimators for House Prices in California

Cross-validation: Mean Squared Log Error over 5 variations of the California Housing data set; n_estimators = 100, learning_rate = 0.1

Average training times on 16036 samples. n_estimators = 100, learning_rate = 0.1

Without hyper-parameter tuning XGBoost has the lowest mean squared log error and the error for the TF model also varies much more over the five folds.

We tested again with a higher number of ensemble estimators and L2 regularisation that did not improve the MSLE and increased the training times by ~75%:

Cross-validation: Mean Squared Log Error over 5 variations of the California Housing data set; n_estimators = 200, learning_rate = 0.1

Average training times for n_estimators = 200, L2 regularisation for TF model

Classification: Predicting income > $50k with U.S. Census data

Cross-validation: Area Under ROC Curve over 5 variations of the California Housing data set; n_estimators = 100, learning_rate = 0.1

ROC curve with AUC for n_estimators = 100, learning_rate = 0.1

The ROC curve illustrate how the TF model performs over various classification thresholds and has the highest average AUC of the three.

To take a closer look at the performance of our Tensorflow classifier, we varied the batching parameters with number_of_batches_per_layer = layer_size * (number_of_train_samples/ BATCH_SIZE)with layer_size at 30%, 60% and 90% of batch data:

Fit times for TF BoostedTreesClassifier on different percentages of 26049 training samples per layer

Training times decreased up to ~ %15 as we were effectively just lowering the amount of training data seen by each layer, with little variance to the AUC.

We did try an experiment with multiple GPUs however, support for distributed training for the Estimator class itself is still limited, and with the current iteration of Tensorflow 2.4 encountered an error trying to replicate the dataset across multiple GPUs using MirroredStrategy() in the estimator configuration. The Keras API still seems to be the better choice for distributed training on a single machine out-of-the-box.

Explainability

Tensorflow Tree Estimators have their own experimental methods for obtaining feature importances, but as standard we currently use SHAP and LIME to better understand how different aspects of our models have impacted the output. Both work with the Keras API and in a similar way with Estimators, we can use a wrapper function with KernelExplainer:

TF BoostedTreesClassifier demo of summary plot using just a few background samples

And similarly we’ve tested this method with LIME on a smaller multi-class dataset:

Final thoughts

Built with the Tensorflow framework, TFBTs bring some extra benefits like being able to create model variable checkpoints with session hooks and easier coding of custom loss functions. The Estimator API as a whole offers more flexibility with model customisation and along with full integration of the Keras package with TF 2.0, everything plays nicely together along with a ready-made helper to switch your Keras models directly over to the Estimator API. There’s also been an introduction of some key functionality to Tensorflow since 2019, eager execution capabilities and overall performance enhancements. And finally, all of this can be simply wrapped up inside our TFX components for production scale ML.

TFBT canned estimators function well as an alternative to Scikit-Learn and XGBoost and I’m sure we’ll be exploring a wide range of applications for both the Keras and Estimator APIs in the coming year.

References:

TensorFlow Estimators: Managing Simplicity vs. Flexibility in High-Level Machine Learning Frameworks: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/18d86099a350df93f2bd88587c0ec6d118cc98e7.pdf
XGBoost: A Scalable Tree Boosting System https://arxiv.org/pdf/1603.02754.pdf
TF Boosted Trees: A scalable TensorFlow based framework for gradient boosting: https://arxiv.org/pdf/1710.11555.pdf
Greedy Function Approximation: A Gradient Boosting Machine: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf
tf.Distribute.MirroredStrategy API reference: https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy
Understanding TFX Pipelines: https://www.tensorflow.org/tfx/guide/understanding_tfx_pipelines