ML Series5: Ensemble Learning

Published in

AIGuys

5 min readAug 12, 2021

🔥The most popular category of learning algorithms

The idea of Ensemble learning also referred to as a multi-classifier system, and committee-based learning is to build a prediction model by combining the strengths of a collection of simpler models. Like the old saying, “Unity is strength!”🥊.

In general, when mixing a good quality with bad quality, we usually get something in between, quality better than the bad one, and worse than the good one. That’s usually the 1 + 1 < 2 case. However, in order to get a 1 + 1 > 2, we need to combine “1”s having a low correlation with each other. That means combing learners with accuracy and diversity.

There are three major ways to combine models

Bagging, homogeneous, parallelly
Boosting, homogeneous, sequentially
Stacking, heterogeneous, parallelly

Bagging

Bagging considers homogeneous weak learners, learns them independently from each other in parallel, and combines them following some kind of deterministic averaging process.

A typical example of the bagging method is Random Forest, which is a strong learner composed of multiple deep tree-based models.

Deep trees have low bias but high variance and, so, are relevant choices for the bagging method that is mainly focused on reducing variance.

In a random forest, there are two parts that are random:

Bootstrapping samples (randomly draw samples from the dataset with replacements)
Keep only a random subset of features to build the tree.

Proof that drawing with replacement will miss 1/e of the whole data set

Boosting

In short, the boosting method converts a set of weak learners into strong learners. After learning a base learner, it updates the weight of sample data to highlight misclassified samples, then uses the updated sample to train the next base learner; After repeating T times, it then adds all these T models together with weights to form a strong learner.

I am going to introduce three kinds of popular boosting methods: Adaboost, Gradient boost, and XGBoost.

Discrete Adaboost

From Wikipedia. Reweighing in this way allows one to bound the training error with an exponentially decreasing function.

Adaboost is the first practical boosting algorithm proposed by Freund and Schapire in 1996. it solves this equation for the exponential loss function under the constraint that h only outputs -1 or 1. As we can see from the algorithm, the more accurate the classifier, the larger the weight. If a sample is correctly predicted, it will get a lower weight next round, and misclassified ones will get a higher weight. Adaboost tends to use tree stumps.

When calculating the weight for t_th classifier, For any classifier with accuracy higher than 50%, the weight is positive. The more accurate the classifier, the larger the weight. While for the classifer with less than 50% accuracy, the weight is negative. It means that we combine its prediction by flipping the sign. For example, we can turn a classifier with 40% accuracy into 60% accuracy by flipping the sign of the prediction. Thus even the classifier performs worse than random guessing, it still contributes to the final prediction. We only don’t want any classifier with exact 50% accuracy, which doesn’t add any information and thus contributes nothing to the final prediction.

Gradient Boost

While Adaboost minimizes exponential loss, gradient boost can be used to solve differentiable loss functions, thus can be used for both classification and regression. When a decision tree is a weak learner, the resulting algorithm is called gradient boosted trees, which usually outperforms random forest.

From Wikipedia. It usually has an additional learning rate v(nu) in Step 2.4 to reduce the effect of each tree has on the final prediction, and this improves accuracy in the long run.

To make these steps more clear, below are the side notes using gradient tree boost for regression as an example:

Regression often use squared loss
Step1.1 use the mean of Y as a starter prediction
Step 2.1 is just residual of (observed — predicted)
After Step 2.2 trains a tree, Step 2.3 computes the leaf node value gamma. For regression, it should be the average of each terminal node.
Step 2.4 makes a new prediction based on the previous tree + learning rate * new tree

XGBoost

XGBoost stands for Extreme Gradient Boosting; Same as Lightgbm, it is a specific implementation of the Gradient Boosting method which uses more accurate approximations to find the best tree model. It employs a number of nifty tricks that make it exceptionally successful, particularly with structured data. The most important are

1.) computing second-order gradients, i.e. second partial derivatives of the loss function (similar to Newton’s method), which provides more information about the direction of gradients and how to get to the minimum of our loss function. While regular gradient boosting uses the loss function of our base model (e.g. decision tree) as a proxy for minimizing the error of the overall model, XGBoost uses the 2nd order derivative as an approximation.

2.) And advanced regularization (L1 & L2), which improves model generalization.

XGBoost has additional advantages: training is very fast and can be parallelized/distributed across clusters.

Stacking

Stacking is more of a training technique than an algorithm like bagging or boosting. The idea is to train several models, usually with different algorithm types (aka base-learners), on the train data, and then rather than picking the best model, all the models are aggregated/fronted using another model (meta learner), to make the final prediction. The inputs for the meta-learner are the prediction outputs of the base learners.

We can consider each model as a feature to feed into the meta learner

When we make a prediction on the test data, we need to pass it through the M base-learners and get the M number of predictions and send those M predictions through the meta-learner as inputs, to get a final prediction.

Interview Questions

What is Random Forest/ Adaboost/ Gradient Boost in one sentence?
What is the difference between Bagging, Boosting, and Stacking?
What does random refer to in “Random Forest”?
Prove that in the Bagging method only about 63% of the total original examples (total training set) appear in any of sampled bootstrap datasets.
Compare Random Forest and Gradient Boosting Decision Tree

Thanks for reading the article! Hope this is helpful. Please let me know if you need more information.