# ML Series5: Ensemble Learning

đź”ĄThe most popular category of learning algorithms

The idea of Ensemble learning also referred to as a *multi-classifier system*, *and committee-based learning* is to build a prediction model by combining the strengths of a collection of simpler models. Like the old saying, â€śUnity is strength!â€ťđźĄŠ.

In general, when mixing a good quality with bad quality, we usually get something in between, quality better than the bad one, and worse than the good one. Thatâ€™s usually the 1 + 1 < 2 case. However, in order to get a 1 + 1 > 2, we need to combine â€ś1â€ťs having a low correlation with each other. That means combing learners with *accuracy* and *diversity.*

There are three major ways to combine models

**Bagging, homogeneous, parallelly****Boosting, homogeneous,****sequentially****Stacking, heterogeneous**,**parallelly**

# Bagging

Bagging considers **homogeneous** weak learners, learns them independently from each other in **parallel,** and combines them following some kind of deterministic averaging process.

A typical example of the bagging method is **Random Forest**, which is a strong learner composed of multiple deep tree-based models.

In a random forest, there are two parts that are random:

- Bootstrapping samples (randomly draw samples from the dataset with replacements)
- Keep only a random subset of features to build the tree.

**Proof** that drawing with replacement will miss 1/e of the whole data set

# Boosting

In short, the boosting method converts a set of weak learners into strong learners. After learning a base learner, it updates the weight of sample data to highlight misclassified samples, then uses the updated sample to train the next base learner; After repeating T times, it then adds all these T models together with weights to form a strong learner.

I am going to introduce three kinds of popular boosting methods: Adaboost, Gradient boost, and XGBoost.

## Discrete Adaboost

Adaboost is the first practical boosting algorithm proposed by Freund and Schapire in 1996. it solves this equation for the exponential loss function under the constraint that h only outputs -1 or 1. As we can see from the algorithm, the more accurate the classifier, the larger the weight. If a sample is correctly predicted, it will get a lower weight next round, and misclassified ones will get a higher weight. Adaboost tends to use tree stumps.

When calculating the weight for t_th classifier, For any classifier with accuracy higher than 50%, the weight is positive. The more accurate the classifier, the larger the weight. While for the classifer with less than 50% accuracy, the weight is negative. It means that we combine its prediction by flipping the sign. For example, we can turn a classifier with 40% accuracy into 60% accuracy by flipping the sign of the prediction. Thus even the classifier performs worse than random guessing, it still contributes to the final prediction. We only donâ€™t want any classifier with exact 50% accuracy, which doesnâ€™t add any information and thus contributes nothing to the final prediction.

## Gradient Boost

While Adaboost minimizes exponential loss, gradient boost can be used to solve differentiable loss functions, thus can be used for both classification and regression. When a decision tree is a weak learner, the resulting algorithm is called gradient boosted trees, which usually outperforms random forest.

To make these steps more clear, below are the side notes using **gradient tree boost for regression** as an example:

- Regression often use squared loss
- Step1.1 use the mean of Y as a starter prediction
- Step 2.1 is just residual of (observed â€” predicted)
- After Step 2.2 trains a tree, Step 2.3 computes the leaf node value gamma. For regression, it should be the average of each terminal node.
- Step 2.4 makes a new prediction based on the previous tree + learning rate * new tree

## XGBoost

XGBoost stands for Extreme Gradient Boosting; Same as Lightgbm, it is a **specific implementation of the Gradient Boosting** method which uses more accurate approximations to find the best tree model. It employs a number of nifty tricks that make it exceptionally successful, particularly with structured data. The most important are

1.) computing **second-order gradients, i.e. second partial derivatives** of the loss function (similar to **Newtonâ€™s method**), which provides more information about the direction of gradients and how to get to the minimum of our loss function. While regular gradient boosting uses the loss function of our base model (e.g. decision tree) as a proxy for minimizing the error of the overall model, XGBoost uses the 2nd order derivative as an approximation.

2.) And advanced **regularization** (L1 & L2), which improves model generalization.

XGBoost has additional advantages: training is very fast and can be parallelized/distributed across clusters.

# Stacking

Stacking is more of a training technique than an algorithm like bagging or boosting. The idea is to train several models, usually with different algorithm types (aka base-learners), on the **train data**, and then rather than picking the best model, all the models are aggregated/fronted using another model (meta learner), to make the final prediction. The inputs for the meta-learner are the prediction outputs of the base learners.

When we make a prediction on the **test data**, we need to pass it through the M base-learners and get the M number of predictions and send those M predictions through the meta-learner as inputs, to get a final prediction.

# Interview Questions

- What is Random Forest/ Adaboost/ Gradient Boost in one sentence?
- What is the difference between Bagging, Boosting, and Stacking?
- What does random refer to in â€śRandom Forestâ€ť?
- Prove that in the Bagging method only about 63% of the total original examples (total training set) appear in any of sampled bootstrap datasets.
- Compare Random Forest and Gradient Boosting Decision Tree

**Thanks for reading the article! Hope this is helpful. Please let me know if you need more information.**