Ensemble Learning: Boosting, Bagging, Stacking (Part 1)

Aymen Boukhari
6 min readAug 1, 2024

--

In this article, we are going to dive deep into ensemble learning methods, including boosting algorithms, bagging, and stacking

In this part, we are going to see:

  1. What ensemble learning is
  2. How to Combine Base Learners?
  3. Ensemble learning methods (Boosting)

1. What is Ensemble Learning?

Ensemble learning is a technique used in machine learning that combines multiple models to perform a specific task, rather than relying on a single model, like SVM or decision trees. This approach can be divided into two categories: parallel and sequential ensembles.

  • Sequential Ensembles: In this category, base classifiers are trained iteratively. The models at each iteration learn to correct the errors made by the previous model.
  • Parallel Ensembles: These train different base classifiers independently and then combine their results to obtain a final result, like in Random Forests. Parallel ensembles can be further divided into homogeneous and heterogeneous ensembles:
  • Homogeneous Ensembles: Use the same base classifier, meaning the same machine learning algorithm is applied.
  • Heterogeneous Ensembles: Use different base models.
Block diagram of parallel ensemble learning
Block diagram of sequential ensemble learning

2. How to Combine Base Learners?

After training our base classifiers, it’s time to make predictions and combine the results of these classifiers. There are several methods for doing this, but the most common ones are majority voting and weighted majority voting.

2.1 Majority Voting:

Majority voting is the simplest and most common method used in ensemble learning. In a classification task, we select the class predicted most often by the base models. In a regression task, we take the average of the predictions from the base models.

Mathematically, if we assume that we have T classifiers and t=1,…,T and C represents the number of classes we have c=1,…,C then the decision of the prediction of the t-th classifier is d_tc∈{0,1}. The final predicted class after majority voting is:

source(paper below)

However, this method assumes that all base classifiers are equally good, even though individual base models in the ensemble often do not have equal performance. This can sometimes lead to suboptimal results if some models are significantly better than others. so we will use another method called weighted majority voting

2.2 weighted majority voting :

It is similar to the first method, but it adds a weight wi​ to every classifier. Generally, these weights are normalized, which means their sum equals one: ∑wi​=1

3 . Ensemble learning methods :

3.1 . Boosting :

Boosting is a technique used to combine weak learners to create a strong learner with high accuracy. It works by training a weak learner on the input data, computing its predictions, and then identifying the misclassified training samples. The next weak learner is trained on an adjusted training set that includes these misclassified instances from the previous round. This process continues iteratively until a predefined number of weak learners (a hyperparameter) is achieved. The results of each learner are then weighted to produce the final result. Examples of this technique include AdaBoost, Gradient Boosting, XGBoost, and others.

One issue with this approach is that it forces the algorithm to focus more on noisy data, which can lead to overfitting

3.1.1 AdaBoost:

AdaBoost (Adaptive Boosting) is one of the most well-known boosting algorithms. Its process begins with training a base classifier, often a decision stump (a simple, shallow decision tree), on the input data. After each classifier makes predictions, AdaBoost assigns higher weights to the misclassified data points. This adjustment helps the subsequent classifiers focus more on the previously misclassified instances.

The algorithm iteratively trains a series of classifiers, with each new classifier emphasizing the mistakes made by the previous ones. After training, the final model combines the predictions of all classifiers, weighting them according to their performance. This approach helps in improving the overall accuracy and robustness of the model.

in mathimathical view if we have m m labelled training instances S = {(x1, y1), . . . ,(xi, yi), . . . ,(xm, ym)} where yi is a label and yi = {-1 , 1} (binary classifiaction) so the weight D1 of the sample xi and the weight update Dt+1 are computed in recursive way as:

source(paper below)
source(paper below)
source(paper below)

Don’t worry! It’s quite simple. Let’s clarify this formula by examining each step of the AdaBoost algorithm:

  1. In the first step, we calculate D1​(i) for each sample i. This is simply 1/N​, where N is the number of samples. This means that all samples initially have equal weights. For example, if we have 4 samples, then D1(i)=1/4

2. Next, we perform a classifier h(xi) (often a decision stump) and use it to predict the label for each sample. We then compare these predicted labels to the true labels.

After making predictions, we calculate ϵ, which is the sum of the weights of the misclassified samples. This measure tells us how well the classifier performed.

We then compute αt​, which indicates the importance of the classifier. It is calculated using ϵ and is used to update the weights of the samples. The updated weights D2(i) are adjusted so that misclassified samples get higher weights, making them more influential in the next round of training.

Here’s a step-by-step breakdown:

. Train Classifier h(xi): Use a decision stump to classify each sample.

. Compare Predictions: Check how the predicted labels match the true labels.

. Calculate ϵ: Sum the weights of the misclassified samples.

source(paper below)

. Compute αt​: Determine the weight of the classifier based on ϵ.

source(paper below)

. Update Weights: Adjust the weights of the samples so that misclassified ones are given more importance for the next iteration.(-1 for 0 labels and 1 for label with 1)

source(paper below)

After completing the iterations, we can predict the labels as follows:

source(paper below)
AdaBoost summary (source : paper below )

--

--