An introduction to model ensembling

Model ensembling represents a family of techniques that help reduce generalization error in machine learning tasks. In this article, I will share some ways that ensembling has been employed and some basic intuition on why it works.


The terminology used in literature varies as ensembling is used in multiple disciplines. However, for the sake of consistency, the following terms are defined:

Rank ensembling: involves combining the predictions of various models on the test set. This is the most basic and convenient way to ensemble as the original models do not need to be retrained.

Stacking (or stack generalization): was introduced by Wolpert in a 1992 paper, 2 years before the seminal Breiman paper “Bagging Predictors”. The basic idea here is to user a pool of base predictors, and then use another predictor to combine the base predictions. This technique was popularized by the Netflix Prize competition.

Blending: very similar to stacking but requires a small holdout set (say 10%) of the train set. The stacker model then trains on this holdout set only.


In real-world cases, training models to generalize on a dataset can be a very challenging problem as it could contain many underlying distributions. Certain models will do well in modelling one aspect of this data while others will do well in modelling the other. Ensembling provides a solution where we can train these models and and make a composite prediction where the final accuracy is better than each of the individual models.

To illustrate this point, let’s assume you have 3 binary classifiers (A, B, C) with a 70% accuracy. These classifiers output a “class 1” (ground truth) 70% of the time and “class 0” the other 30% of the time. For a majority vote with 3 models, there are 4 possible scenarios:

Scenario 1 —  (all three models are correct) 
= 0.7 * 0.7* 0.7 = 0.3492Scenario 2 — (two models are correct) 
= (0.7*0.7*0.3)+(0.7*0.3*0.7)+(0.3*0.7*0.7) = 0.4409Scenario 3 — (two models are wrong)
= (0.3*0.3*0.7)+(0.3*0.7*0.3)+(0.7*0.3*0.3) = 0.189Scenario 4 — (all three models are wrong)
= 0.3*0.3*0.3 = 0.027

From this we see a majority vote corrects an error ~44% of the time. More importantly, this majority vote will be correct ~78% (0.3492 + 0.4409) of the time. This illustrates the fact that even though the base model had an accuracy of 70% the ensemble has an accuracy of 78%!

A good ensemble contains high performing models which are less correlated.

Using the same example as before where the ground truth is “class 1” (represented as 1). Three highly correlated models produce an ensemble where there is no improvement:

1111111100 = 80% accuracy
1111111100 = 80% accuracy
1011111100 = 70% accuracyThe majority vote ensemble produces:
1111111100 = 80% accuracy

Now, comparing the results of 3 uncorrelated modes:

1111111100 = 80% accuracy
0111011101 = 70% accuracy
1000101111 = 60% accuracyThe majority vote ensemble produces:
1111111101 = 90% accuracy

Ensembling techniques

The following gist outlines how a typical ensembling script would be structured and helped me understand the fine points:

Another excellent example which implements ensembling with a dataset can be found HERE.

Rank ensembling techniques

Voting: When you have a good collection of large well performing models, implementing a voting ensemble works well as illustrated in the ‘intuition’ section above. A sample voting script will help walk through a practical example.

Averaging: Works well on a wide range of problems and metrics (AUC, MSE, LogLoss). Here, an average of predictions from all base models is used to make a final prediction. A sample averaging script will help walk through the details.

Rank averaging: Similar to averaging, but instead of giving every model an equal weight, using a normalized validation score of all the models as a weight can help improve generalization. A sample rank averaging script will help walk through the details.

Historical averaging: rank averaging requires a validation set. However, if you only want to predict for a single new sample, a solution could be to use historical ranks. This involves storing old test set predictions together with their rank. Hence, when you want to make a new prediction, you find the closest old prediction and take its historical rank.

Stacking techniques

Feature weighted linear stacking: This stacks engineered meta-features together with model predictions. The idea here is that the stacking model learns which base model is the best predictor for samples with a certain feature value. Full implementation details and the intuition behind this method are outlined in this paper by Sill et al.

Quadratic weighted stacking: This is similar to the feature weighted linear stacking method, but it creates non-linear combination of model predictions as features for the second stage model.

StackNet: This meta-modelling framework resembles a feedforward neural network and uses Wolpert’s stacked generalization on multiple levels. However, rather than being trained through back propagation, like a traditional neural network, StackNet is built iteratively one layer at a time (using stacked generalization), each of which uses the final target as its target. Details of StackNet can be found in it’s Github repo.

Concluding thoughts

Even though there are many advantages to ensembling, here are some pitfalls one should consider:

  • Exponentially increasing training times and computational requirements. This greatly reduces iteration time and ultimately the number of experiments you can run.
  • Increased demand on infrastructure to maintain and update these models.
  • Greater chance of data leakage between models and/or stages in the ensemble.

However, even though these monster ensembles have their issues, here are some advantages you should consider:

  • You can beat most state-of-the-art academic benchmarks which were established with a single model.
  • These ensemble models provide insight about the data and learning can be transferred to a simpler shallow model to decrease overall complexity.
  • Not all base models necessarily need to finish in time. In that regard, ensembling introduces a form of graceful degradation: loss of one model is not fatal for creating good predictions.
  • Finally, large ensembles ward off against overfitting and add a form of regularization, without needed to fiddle around and tune individual models.

Happy ensembling!

Weights and Biases

Teaching machines to be intelligent

Jovan Sardinha

Written by

Weights and Biases

Teaching machines to be intelligent