Bagging: Quantity, Not Quality.
Bootstrap aggregating with Example

7 min readJun 14, 2024

Generated by AI: *Oversimplified trees, bootstrap aggregating, ensemble methods, weak learners, reduce variance*

Ensemble methods — quantity, not quality

Ensemble methods in machine learning refer to techniques that combine multiple models to improve predictive performance. The basic idea behind ensemble methods is to aggregate the predictions of several base models (often called weak learners) to produce a final prediction that is often more accurate and robust than any individual model.

Personally, I usually live by the principle of Quality over quantity. However, in this case, it turns out that the opposite principle works just as well.

Ensemble methods typically fall into two categories:

Bagging:
This approach involves training multiple instances of the same base learning algorithm on different subsets of the training data. Each model in the ensemble learns independently, and then their predictions are combined, often by averaging or voting, to make the final prediction.
Boosting:
Boosting works by sequentially training a series of weak learners, where each subsequent model focuses on the examples that the previous ones struggled with. The final prediction is usually a weighted sum of the predictions made by each weak learner.

Common ensemble methods include Random Forest, AdaBoost, Gradient Boosting Machines or Extreme Gradient Boosting. These methods are widely used in various machine learning tasks due to their ability to improve predictive accuracy and generalization performance.

In this article, we will take a closer look at the first of the methods, which is bagging.

What is bagging? — short introduction

Bagging (Bootstrap aggregating), is a technique in machine learning where we create multiple copies of a model and train each copy on different subsets of the training data. These subsets are created by randomly selecting samples with replacement (this is where the “bootstrap” part comes in). After training each model, their predictions are combined in some way to make a final prediction.
Bagging helps to reduce the variance in the predictions by averaging or voting over multiple models, leading to more stable and accurate predictions compared to using a single model alone.

Imagine you’re trying to guess the number of candies in a jar. If you ask just one friend for their guess, they might be off by a lot. But if you ask several friends, each with different perspectives and ways of guessing, and then you average out their guesses, you’re likely to get a much more accurate estimate.

How to reduce variance?

As we mentioned earlier, bagging is a method of reducing variance. It is based on a simple observation.
Let’s assume we have 𝑛 independent random variables 𝑋1,…,𝑋𝑛 with the same variance 𝜎^2. Each of these variables corresponds to the predictions of each weak learner. What if we average the results across all weak learners?
Let 𝑋 ― be the average of all weak learners. Let’s see how the variance of such a random variable will change.

So, we want to find the variance of

Let’s do some calculations. From the properties of variance

Since the variables 𝑋𝑖 are independent, we can write:

But, all random variables 𝑋1,…,𝑋𝑛 have the same variance 𝜎^2, so:

Therefore

Averaging a set of observations reduces variance. Therefore, a natural way to decrease variance, and thus increase the accuracy of prediction of a given learning method, is to draw multiple training sets from the population, build a separate predictive model using each training set, and average the resulting predictions.

How does bagging work?

In short, we build 𝐾 models (𝑓1,…,𝑓𝑘) using 𝐾 different training sets (𝑆1,…,𝑆𝑘). Each model 𝑓𝑖 is trained on a different set 𝑆𝑖. Then, we average the results obtained from all 𝐾 models to obtain a single statistical model with low variance

Unfortunately, in practice, it is usually difficult to obtain so many different training sets. We often struggle with the problem of having too little data. What can we do in such a situation? As you probably guessed, the name “bootstrap aggregation” reveals how we can deal with the problem of obtaining different training data sets — perform bootstrap sampling.

Bootstrap sampling is nothing more than random sampling with replacement.

Random sampling with replacement is a process where items are randomly selected from a dataset, and after each selection, the item is put back into the dataset. This means that the same item can be selected multiple times during the sampling process.

Therefore, in the bagging method, or more precisely bootstrap aggregation, we build 𝑘 models using 𝑘 bootstrap samples, and then we average the results obtained from all models to obtain a single prediction.

Side effect — the “out of bag” set

It turns out that on average only about 2/3 of the observations in the bootstrapping process are used to construct the tree. Observations that were not used during tree construction are called out-of-bag (OOB) observations.

So, if we perform bootstrap 𝐾 times, then for each observation on average 𝐾/3 trees did not use that observation. We can use these trees to estimate prediction error by taking the average error for those trees. The total error estimated by OOB (as the average error across all observations) is a good approximation of the test error.

Why are 1/3 of observations not utilized in the tree construction process?

Let’s imagine we have 𝑛 observations. Then, the probability of not selecting a single observation is equal

If we draw them with replacement n times, then the probability is

What in the limit (or actually already for large 𝑛) gives approximately

Example

Let’s start by generating an artificial dataset.

The data is one-dimensional and represents a function on the interval [0,10] described by the equation 𝑦=𝑥sin⁡(𝑥) to which some random noise has been added.

At first, we need to determine how many estimators (models) we want to build (i.e., we need to specify the number 𝐾). Let’s assume 𝐾=3.
Now, we will demonstrate how bootstrap works.
In bootstrapping, the goal is to randomly sample observations with replacement. In our case, three times, for each of the models separately, we draw a training sample.
Because we are drawing with replacement, some observations may not be drawn at all, while others may be drawn multiple times. The more intense the yellow color, the more times a given observation has been drawn.

Now we train a separate estimator (in our case, it’s a decision tree) for each of the samples separately. This way, we will obtain three models of regression trees.
We will present them all on one plot to better visualize the difference between them.

By aggregating the results, i.e., taking the average prediction value from all 3 trees, we obtain the final model.

For example, let’s look at the predictions of each tree built on different bootstrap samples for the value 𝑥=8:
Tree 0: 4.54997803
Tree 1: 5.64685022
Tree 2: 5.79985777
The final model is simply the average of the results of individual components, therefore:

Implementation

Fortunately, we don’t have to do all of this manually. We are assisted by the implementation in the popular sklearn module.
We only need to choose the estimator — in our case, it is a regression tree, as well as the number of these estimators, which are the models we want to build to then average their results.

from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor

bagged_trees = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(max_depth=3),
    n_estimators=3,
    )
bagged_trees.fit(data_train, target_train)

The results using BaggingRegressor from sklearn are as follows: