Published in

AIGuys

# ML Series5: Ensemble Learning

đź”ĄThe most popular category of learning algorithms

The idea of Ensemble learning also referred to as a multi-classifier system, and committee-based learning is to build a prediction model by combining the strengths of a collection of simpler models. Like the old saying, â€śUnity is strength!â€ťđźĄŠ.

In general, when mixing a good quality with bad quality, we usually get something in between, quality better than the bad one, and worse than the good one. Thatâ€™s usually the 1 + 1 < 2 case. However, in order to get a 1 + 1 > 2, we need to combine â€ś1â€ťs having a low correlation with each other. That means combing learners with accuracy and diversity.

There are three major ways to combine models

• Bagging, homogeneous, parallelly
• Boosting, homogeneous, sequentially
• Stacking, heterogeneous, parallelly

# Bagging

Bagging considers homogeneous weak learners, learns them independently from each other in parallel, and combines them following some kind of deterministic averaging process.

A typical example of the bagging method is Random Forest, which is a strong learner composed of multiple deep tree-based models.

In a random forest, there are two parts that are random:

• Bootstrapping samples (randomly draw samples from the dataset with replacements)
• Keep only a random subset of features to build the tree.

Proof that drawing with replacement will miss 1/e of the whole data set

# Boosting

In short, the boosting method converts a set of weak learners into strong learners. After learning a base learner, it updates the weight of sample data to highlight misclassified samples, then uses the updated sample to train the next base learner; After repeating T times, it then adds all these T models together with weights to form a strong learner.

I am going to introduce three kinds of popular boosting methods: Adaboost, Gradient boost, and XGBoost.

Adaboost is the first practical boosting algorithm proposed by Freund and Schapire in 1996. it solves this equation for the exponential loss function under the constraint that h only outputs -1 or 1. As we can see from the algorithm, the more accurate the classifier, the larger the weight. If a sample is correctly predicted, it will get a lower weight next round, and misclassified ones will get a higher weight. Adaboost tends to use tree stumps.

When calculating the weight for t_th classifier, For any classifier with accuracy higher than 50%, the weight is positive. The more accurate the classifier, the larger the weight. While for the classifer with less than 50% accuracy, the weight is negative. It means that we combine its prediction by flipping the sign. For example, we can turn a classifier with 40% accuracy into 60% accuracy by flipping the sign of the prediction. Thus even the classifier performs worse than random guessing, it still contributes to the final prediction. We only donâ€™t want any classifier with exact 50% accuracy, which doesnâ€™t add any information and thus contributes nothing to the final prediction.

While Adaboost minimizes exponential loss, gradient boost can be used to solve differentiable loss functions, thus can be used for both classification and regression. When a decision tree is a weak learner, the resulting algorithm is called gradient boosted trees, which usually outperforms random forest.

To make these steps more clear, below are the side notes using gradient tree boost for regression as an example:

• Regression often use squared loss
• Step1.1 use the mean of Y as a starter prediction
• Step 2.1 is just residual of (observed â€” predicted)
• After Step 2.2 trains a tree, Step 2.3 computes the leaf node value gamma. For regression, it should be the average of each terminal node.
• Step 2.4 makes a new prediction based on the previous tree + learning rate * new tree

## XGBoost

XGBoost stands for Extreme Gradient Boosting; Same as Lightgbm, it is a specific implementation of the Gradient Boosting method which uses more accurate approximations to find the best tree model. It employs a number of nifty tricks that make it exceptionally successful, particularly with structured data. The most important are

1.) computing second-order gradients, i.e. second partial derivatives of the loss function (similar to Newtonâ€™s method), which provides more information about the direction of gradients and how to get to the minimum of our loss function. While regular gradient boosting uses the loss function of our base model (e.g. decision tree) as a proxy for minimizing the error of the overall model, XGBoost uses the 2nd order derivative as an approximation.

2.) And advanced regularization (L1 & L2), which improves model generalization.

XGBoost has additional advantages: training is very fast and can be parallelized/distributed across clusters.

# Stacking

Stacking is more of a training technique than an algorithm like bagging or boosting. The idea is to train several models, usually with different algorithm types (aka base-learners), on the train data, and then rather than picking the best model, all the models are aggregated/fronted using another model (meta learner), to make the final prediction. The inputs for the meta-learner are the prediction outputs of the base learners.

When we make a prediction on the test data, we need to pass it through the M base-learners and get the M number of predictions and send those M predictions through the meta-learner as inputs, to get a final prediction.

# Interview Questions

• What is the difference between Bagging, Boosting, and Stacking?
• What does random refer to in â€śRandom Forestâ€ť?
• Prove that in the Bagging method only about 63% of the total original examples (total training set) appear in any of sampled bootstrap datasets.
• Compare Random Forest and Gradient Boosting Decision Tree