AI/ML Security Pro Tips: The Power of Ensembling

Published in

AI/ML at Symantec

7 min readMar 6, 2019

Machine learning models have been utilized with great success in many different application domains including image classification, natural language processing, and security to name a few. In this blog post, we will explore ensembling, which is a collection of techniques for combining models together in order to create a more powerful model. Ensembling methods have been proven valuable in both academic and industry settings and here we review some background and select methods with a focus on machine learning classification. An understanding of ensembling concepts is not only useful information, but also a necessary prerequisite (in my view) to confront many complex real-world classification problems.

Universal Approximation

Some widely-used models, such as random forests and neural networks, are known to be universal approximators, which means they have the capacity to learn any function mapping between inputs and outputs, although this may only occur asymptotically. So, what exactly does this mean? If we can approximate any mapping of input characteristics (features) to our classification labels, then shouldn’t we be able to solve any classification problem perfectly? Initially it may seem so — but fitting a function to a training dataset is only part of the picture and as soon as we clarify our true goal (which is revealed in the next section), it will become apparent that limitations do indeed arise despite the great power of universal approximation.

No Free Lunch

There will always be an inherent tension between learning specific patterns in the data used for training and generalizing to unseen data. For classification problems, the goal is to learn a mapping from input data to an output class label which will also generalize well to new data not seen during training. Unfortunately, the no free lunch theorem tells us there is no model creation method which can generalize well across all possible input data distributions. One problem that arises when training machine learning models is overfitting, which occurs when the model learns the training data set very well but fails to generalize well to new data. Overfitting occurs when the model starts paying attention to irrelevant details (i.e. noise) in the data. This problem is mitigated to a large extent by a technique known as regularization, which adds an extra penalty in the training process to encourage the model to limit its complexity, and this has the end result of better generalization. Also, the use of validation data, which tests how well the model performs on data outside the training set, helps to mitigate overfitting by essentially checking for it during training. The “no free lunch” theorem simply states that there is no procedure to find a model that works well across all problems (there is no magic solution for everything) but this certainly does not imply that there is not a good model, nor does it prevent us from finding a good model for our specific problem.

Bagging, Boosting, Stacking

Some classes of models are themselves already a hierarchy of simpler models. For example, one can think of neural networks as a collection of logistic regression models (loosely speaking, depending on the activation functions used) and random forests are collections of decision trees. We can therefore think of ensembling techniques as extending the hierarchy further in order to partially overcome the limitations of the constituent models. There are many different ways to combine models together — we will review some of the common techniques including bagging, boosting, and stacking, and provide some insight as to how they operate.

Bagging, a shorthand term for “bootstrap aggregation”, is a method which utilizes different model instances of the same algorithm type (in classical bagging) which are trained on random subsets of the training data. How are these random subsets obtained? Usually by sampling with replacement, which means samples can be used multiple times. The predictions from each model are combined, typically by averaging or majority voting. The end result is a more robust model with less output variance in the predictions. The random forest is an example of bagging which randomly subsamples both training data and input features. Bagging generally provides better results overall (due to the variance reduction) as compared to just training on the entire dataset in one shot. A detailed review of the bias-variance trade-off in machine learning is a discussion for another day…

Boosting takes a different approach— a strong model is built sequentially from a cascade of weaker models, each of which are tuned to learn those samples which proved difficult for the previous stages. Each model potentially sees all of the data samples in the training set, but the samples are weighted based on the position in the cascade. Additionally, each model does not necessarily need to use the same algorithm. The combining function can be arbitrary and varies with the particular method. For example, with Adaboost (a well-known boosting technique) the output is combined from a weighted sum of model outputs, where the weights are updated using a specific update rule, whereas with gradient boosting (another well-known technique) the weights are learned via a technique called gradient descent.

Figure 2 — Boosting Method of Ensembling

What if you already have some classifiers, but want to improve results? Or perhaps you only have access to the outputs of other classifiers? Stacking to the rescue… Stacking utilizes a collection of models based on different learning algorithms which are trained on the entire training set, and then combines the predictions using a combining function which is learned from an additional training step. The combining function can be an arbitrary combining algorithm, but logistic regression is frequently used because it is simple and tends to work well for many problems.

Figure 3 — Stacking Method of Ensembling

Performance Considerations

Employing ensembling techniques is not a trivial task — there is no universal guarantee that the ensemble will perform better than any constituent model across the metrics-of-interest. One important consideration is how each model provides complementary information to the meta-learner or combiner. In the bagging method, each model instance uses the same algorithm but sees different data samples. With boosting, each model can utilize different algorithms and sees the data according to its own weighting in the cascade. Stacking also utilizes model instances with different algorithms each of which can partition the input space using different mechanisms. One of the big challenges for ensembling strong models is that they already perform well, which by definition implies correlated outputs.

We can illustrate this idea with a simple classification problem using decision trees trained on 20-dimensional synthetic Gaussian mixture data and comparing the ROC (receiver operator characteristic) curves for the individual decision trees vs. the ensemble. The ensemble predictions will be computed by averaging the predictions from the individual models (bagging). The ROC curve shows how a classifier’s discriminative ability varies as a function of the classification threshold and gives a good view into the overall performance of the classifier. A perfect classifier would produce an ROC curve which approaches the (TPR,FPR)=(1,0) coordinate and random guessing would yield a line with slope=1 (on a linear scale) which connects (0,0) to (1,1).

In this example, we take five simple decision trees which are constrained to utilize only two features when constructing decision thresholds and have a maximum depth of two levels. In this case the ensembled ROC (black curve) generally outperforms the individual models since it aggregates information from many features while the individual models only see a small subset of features.

Figure 4 — Ensembling Simple Decision Trees

Now, if we increase the complexity of each decision tree by allowing more features (i.e. sixteen) when constructing decision thresholds and increasing the maximum depth to four levels, then each tree becomes more of an ‘expert’ and the variance of the curves decreases as the trees are mostly in consensus. The ensembled ROC (black curve) no longer outperforms all models across the domain and in some areas approaches the average. Note that these examples are for illustration only and much better absolute performance can be achieved with more sophisticated models such as random forests, deep networks, etc. We can see that it is more difficult to realize an improvement when ensembling models with highly correlated output predictions than models which make different errors and so provide complementary information to the meta-learner/combiner.

Figure 5 — Ensembling ‘Expert’ Decision Trees

Conclusion

Ensembling is a powerful and proven set of methods for improving classification performance and robustness of machine learning models. We should keep in mind the effects of using weak-vs-strong learners and the fundamental limitations of generalization to guide our model choices and set expectations for performance. In practice, our data may be limited and noisy, and it may be difficult to determine which ensemble method to adopt for a given problem, but in principle ensembling should provide benefit even if there are diminishing returns when the constituent models are highly correlated. By joining the predictive strengths of many models, ensembling is a powerful way to scale our capabilities to complex real-world classification problems.