A Comprehensive Guide to Ensemble Techniques: Bagging and Boosting

Abhishek Jain
6 min readSep 13, 2024

--

In machine learning, ensemble techniques are powerful methods that combine the predictions of multiple models to improve accuracy, reduce variance, and enhance generalization to unseen data. Instead of relying on a single model, ensemble methods leverage the collective power of several models to make more robust predictions. Two of the most popular ensemble techniques are bagging and boosting, both of which are widely used to enhance the performance of models, especially decision trees.

In this blog, we will dive into what ensemble techniques are, explore the concepts of bagging and boosting, and use examples to illustrate how they work.

What Are Ensemble Techniques?

Ensemble techniques aim to improve the predictive performance of machine learning models by combining the outputs of several base models, known as weak learners. A weak learner is a model that performs slightly better than random guessing, and when many of them are combined, the result is a strong learner that significantly improves accuracy.

There are two main types of ensemble techniques:

  1. Bagging (Bootstrap Aggregating)
  2. Boosting

Why Use Ensemble Techniques?

  • Reduce Overfitting: By averaging multiple models, ensemble techniques can smooth out irregularities that may cause overfitting in a single model.
  • Improve Accuracy: Combining multiple weak models leads to a more accurate prediction.
  • Increase Stability: Ensembles are more robust to variations in the training data and less sensitive to noise.

1. Bagging (Bootstrap Aggregating)

Bagging is an ensemble technique that aims to reduce variance and prevent overfitting by training multiple models independently and then averaging their predictions. Bagging works by generating different training datasets through bootstrapping and training a separate model on each of these datasets.

Steps to Generate Training Datasets Using Bootstrapping:

  1. Original Dataset: Begin with the original training dataset of size N, containing N data points or samples.
  2. Sampling with Replacement: For each bootstrapped dataset, randomly select NNN samples from the original dataset. Since this is sampling with replacement, after selecting a sample, it is returned to the dataset, so it can be picked again in the future.
  • Some samples will be selected multiple times.
  • Other samples might not be selected at all in a particular bootstrapped dataset.

3. Create Multiple Bootstrapped Datasets: Repeat the above process multiple times to generate MMM different bootstrapped datasets, where MMM is the number of models (or weak learners) you plan to train in your bagging ensemble. Each of these datasets will have the same number of samples (NNN), but the distribution of data points in each will differ because of the random sampling.

4. Train Models: Train a separate model (e.g., a decision tree) on each of the bootstrapped datasets.

5. Aggregate Predictions: After all models are trained, combine their predictions (e.g., by majority voting in classification tasks or averaging in regression tasks) to make the final prediction.

Example of Bootstrapping:

Let’s say you have a small dataset with 5 data points:

Original Dataset={A,B,C,D,E}

Using bootstrapping to generate a new dataset of size 5, you might sample:

  • Sample 1: {B,C,A,A,E} (Notice that A appears twice, and D is missing)
  • Sample 2: {A,D,C,C,B} (Here C appears twice, and E is missing)
  • Sample 3: {B,B,D,E,C}

Each of these samples will be used to train a separate model.

How Bagging Works:

  1. Bootstrap Sampling: Randomly sample data from the training set with replacement to create multiple subsets (bootstrap samples). Some data points may be repeated, while others might be omitted in each sample.
  2. Train Models: A separate model (usually a weak learner like a decision tree) is trained on each bootstrap sample.
  3. Aggregate Predictions: For classification tasks, the predictions from all the models are combined using majority voting. For regression tasks, the models’ predictions are averaged.
Bagging technique general workflow

Advantages of Bagging:

  • Reduces variance: By averaging the predictions of multiple models, bagging reduces the model’s variance, making it less likely to overfit.
  • Parallelizable: Since each model is trained independently, bagging can be parallelized easily.

Disadvantages of Bagging:

  • Less effective on bias: Bagging primarily reduces variance but does not significantly reduce bias if the individual models are weak.

What Does Bias Mean in the Context of Bagging?

When we say bagging reduces variance but does not significantly reduce bias, we’re highlighting the distinction between two types of errors a model can have:

  1. Bias (Systematic Error):
  • This is the error due to overly simplistic assumptions in the model. If the individual models (e.g., decision trees) are weak and overly simplistic, they will not capture the full complexity of the data. As a result, even after bagging, the ensemble model may still have high bias, meaning it systematically makes wrong predictions.
  • For example, if the base model is a decision tree that has been pruned too much (shallow trees), it might not capture important interactions between features, leading to inaccurate predictions even on the training data.

2. Variance (Sensitivity to Data):

  • Variance refers to the sensitivity of a model to fluctuations in the training data. Models with high variance (such as deep decision trees) might overfit the training data, capturing noise and leading to poor generalization on unseen data. Bagging, by combining multiple models trained on different subsets of data, helps smooth out this overfitting and thus reduces variance.
  • A good model should have low variance (less sensitive to outliers) because high variance can lead to overfitting, which means the model performs well on the training data but poorly on unseen data (i.e., it fails to generalize).

Bias in Weak Learners

In the context of bagging, weak learners are models that have high bias but low variance. For example, a shallow decision tree may perform poorly because it doesn’t model the data in a detailed enough way (high bias). When bagging combines many of these weak learners, the variance of the ensemble is reduced because the models are aggregated, but since all the individual models are systematically biased, bagging will not correct this bias.

2. Boosting

Boosting is another ensemble technique that focuses on reducing both bias and variance by training models sequentially, where each subsequent model attempts to correct the errors of the previous ones. Unlike bagging, where models are trained independently, boosting builds models iteratively.

In case of regression, the test data will go to each of the models and will give us the average of all the values returned by each model. For classification, voting(technical term — maximum voting classifier)will take place and the class with majority outcome will be declared as the final result

How Boosting Works:

  1. Train Weak Learners Sequentially: Models are trained one after another, and each model pays more attention to the data points that were misclassified by previous models.
  2. Adjust Weights: In each iteration, the weights of incorrectly predicted samples are increased, so subsequent models focus more on those difficult-to-predict samples.
  3. Aggregate Predictions: The final prediction is made by taking a weighted average or sum of the individual model predictions.

Example of Boosting:

Imagine a boosting model is trained to predict house prices. The first weak learner might underperform on large houses, but the second weak learner focuses on improving predictions for these large houses. The process continues, with each new model focusing on areas where previous models made errors.

Advantages of Boosting:

  • Reduces bias and variance: Boosting reduces both the bias and variance of the model, resulting in highly accurate predictions.
  • Works well with weak learners: Even weak models, like shallow decision trees, can be combined through boosting to create a powerful predictor.

Disadvantages of Boosting:

  • Sensitive to outliers: Since boosting focuses on correcting errors, it can place too much emphasis on noisy data points or outliers.
  • Computationally expensive: Boosting is a sequential process, making it slower than parallelizable methods like bagging.

--

--