Building Ensembles: AWS AutoGluon

Going deeper into the algorithms behind AWS’s new competition-winning AutoML Library

7 min readDec 24, 2021

The key to producing state-of-the-art results with you models is building ensembles correctly. Ensembles turn many weak-learners into a strong-learner by building on the predictive signals from different models and combining them to get a more accurate estimate.

Before getting deeper into AWS AutoGluon and building ensembles, it is best to start off with an understanding of a few important concepts:

Stacking

Model stacking builds hierarchically on top of model predictions, using predictions as input features for later models. Stacking refers to training higher layers of models built on predictions from lower layers. Model stacking works “vertically”, creating multiple layers of models rather than expanding component models. Using models with synergy is important. The best ensembles have models that balance each-other out by performing well in uncorrelated cases.

Gradient boosting

“Gradient Boosting” originates from the paper Greedy Function Approximation: A Gradient Boosting Machine, by Friedman.

The key here is fitting a sequence of models to the errors in predictions of the ensemble. The final prediction is the sum of predictions from the sequence, producing a predictive model from an ensemble of weak predictive models.

Predictions from the sequence are combined. For example, a gradient boosting ensemble is usually implemented by fitting Random Forest models to the residual error from the previous collection of models until an early stopping condition is reached.

The Logistic loss function is also common.

Boosting ensembles are the “horizontal” counterpart to stacking ensembles; they individually improve predictions of component models. Indeed, both techniques are often used together in one meta-ensemble.

Early stopping is the key to effective regularization which is necessary to prevent these models from overfitting. The popular Extreme Gradient Boosting (XGBoost) model implements important optimizations for this algorithm that are key to running on billions of samples in a parallel distributed environment.

Blending

Aggregating the outputs from different models trained to predict the same variable is called “Blending”. Most often a linear model should be chosen for blending predictions, but more complex blending functions can be used to fine-tune ensemble predictions. As with stacking, predictions from the last layer are used to learn parameter weights.

Random Forests

The key to the RF algorithm is growing many different binary trees in parallel. The Classification and Regression Tree Algorithm (CART) works by splitting instances at each node on a set of feature and threshold combinations. A randomly selected subset of features is chosen to evaluate with the set of threshold values. Performance is measured by impurity, the level of class separation of its children nodes.

The training time complexity of a Random Forest model is: O(n*m*log(m)) where n is the number of features, m is the number of samples, and the binary logarithm is used. Prediction time complexity is also very fast, O(log(m)).

Entropy vs. Gini Impurity

Entropy comes from Shannon’s information theory, where it measures the average information content of a message. Entropy is zero when all messages are identical. In Random Forests, the key is that entropy is zero when a node contains instances of a single class.

In this equation, the sum is calculated over only classes with non-zero probabilities (pi, k != 0).

Gini impurity performs very similarly to entropy with slightly faster computation. Where Gini impurity tends to isolate the most frequent class in its own branch of the tree, entropy tends to produce slightly more balanced trees.

pi, k is the ratio of class k instances among the training instances in the ith node. It is important that Gini impurity is zero when all instances in the node are from only one class.

LightGBM

XGBoost and LightGBM are both very similar RF algorithms. It is important that LightGBM grows leaf-wise, compared to the level-wise growth of XGBoost. Growth for a RF model means adding additional nodes/leaves to the tree.

CatBoost

CatBoost uses a permutation-driven alternative to the classic boosting algorithm. It is important that “Categorical Boosting” performs well on categorical features and small datasets.

While the unconstrained gradient boosting algorithms quickly overfit, CatBoost uses an effective regularization method. It relies on the ordering principle called Target-Based with prior Statistic (TBS). The values of TBS for each example depend only on the observed history. To add a signal for time, we introduce an artificial random permutation σ of the training examples.

The key is capturing high-order data dependencies like coincidence of customer region and ad topic in the task of customer response prediction. CatBoost creates new engineered features from combinations of categorical features.

Extremely Randomized Trees

Extra-Trees is much faster to train than a regular Random Forest because it uses random threshold values for splitting each node. This significantly reduces the combinations of possible splits, resulting in a faster evaluation time for training. The technique trades more bias for a lower variance — in other words a lower level of fit. It is an important addition to an ensemble of heavily fitted models. Intuitively, Extra-Trees provides a counter-balance for when the training data is not a great representation of the test set.

k-Nearest Neighbors

Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point. The optimal choice of the value k is highly data-dependent: in general a larger k suppresses the effects of noise, but makes the classification boundaries less distinct.

Deep Neural Networks

Feed-forward neural networks are best for processing tabular data. Through convolutional networks are state-of-the-art in image processing and audio transduction, tabular datasets do not have strong local patterns in the features dimension. Where it may be helpful to measure image features such as curvature by patches of pixels, these local patterns do not exist to the same extent in tabular (spreadsheet) data with much more randomness to feature ordering.

Dense (feed-forward) neural networks connect all input layer neurons to all output layer neurons, so long-distance relationships are made possible in a single layer of connection. All spatial information is eliminated and of course, dense matrix multiplication requires far more compute than sparse operations. Still, neural networks are an important addition to an ensemble. Their predictions are relatively uncoordinated with Random Forests, making them especially important for ensembles composed mostly of RF models.

AWS uses a special architecture proven to work better for tabular data that is mixed (i.e. includes both categorical and numeric features).

source: AWS AutoGluon Team, 31 Mar. 2020

Categorical and numerical features are embedded separately before concatenation. This amounts to a prior of datatype distinction and forces the model to treat categorical and numerical features differently. An important detail in the architecture is including residual connections so signals learned from the data in lower layers can be propagated intact to later layers, eliminating information loss through stochastic dispersion.

AutoGluon-Tabular

Training a model with AutoGluon-Tabular is accomplished in three simple steps:

1) ‘Hello World’ : install required packages

!python3 -m pip install -U pip
!python3 -m pip install -U setuptools wheel
!python3 -m pip install -U “mxnet<2.0.0”
!python3 -m pip install autogluon

2) Load dataset

dataset = ‘dataset’
!kaggle competitions download -p {dataset} -q otto-group-product-classification-challenge
!unzip -d {dataset} {dataset}/otto-group-product-classification-challenge.zip
!rm {dataset}/otto-group-product-classification-challenge.zip
from autogluon import TabularPrediction as task
train_data = task.Dataset(file_path=f’{dataset}/train.csv’).drop(‘id’, axis=1)

3) Fit AutoGluon Model

label_column = ‘target’ # specifies which column do we want to predict
savedir = ‘saved_models/’ # where to save trained models
predictor = task.fit(train_data=train_data,
                     label=label_column,
                     output_directory=savedir,
                     eval_metric=’log_loss’,
                     auto_stack=True,
                     verbosity=2,
                     visualizer=’tensorboard’)

Monitor training

tensorboard — logdir otto_models/models/

Now point your browser to http://0.0.0.0:6006/

Next, make predictions and evaluate performance.

pred_probablities = predictor.predict_proba(test_data, as_pandas=True)

Summary

We learned about the correct way to use gradient boosting ensembles and the AWS AutoGluon-Tabular library for producing an ensemble predictor.

We covered key concepts for working with tabular data, “boosting”, “stacking”, “blending”, Euclidean and logistic loss functions, entropy, gini impurity and the learning algorithms that have been proven most successful:

Loss Functions

Entropy
Gini Impurity
MSE
Logistic Loss

Prediction Algorithms

Random Forests
Extremely Randomized Trees
k-Nearest Neighbors
LightGBM (boosted trees)
CatBoost (boosted trees)
Deep Neural Networks

To read more articles like this, follow me on Twitter, LinkedIn, or my Website.