do YOU know the ensemble

quick and deep dive into ensembling techniques

Rrohan.Arrora

Published in

AI n U

9 min readApr 6, 2020

— What are ensembles

Consider a situation where you want to start your company and you are looking for some people highly qualified in their domain. You would definitely want to start your company with highly qualified people so as to deliver satisfactory results. Now, replace those bunch of highly qualified people with various machine learning models.

Thus, ensembles consist of multiple base models combine together. The more different the models are, the better you can combine them. Ensembles are very powerful and high performing models. They are very useful in the real world.

— Various ensembling techniques

Bagging or Bootstrap Aggregation
Boosting
Stacking Models
Cascading

Let us understand various techniques in detail.

— Bagging or Bootstrap Aggregation

Let us build an intuition behind the bagging technique. Consider a large dataset of points D{xᵢ, yᵢ} consisting of 10k points. Rather than building one whole model on the complete dataset and then using the results, what if we train a large number of models of a subset of data points.

D{xᵢ, yᵢ}
→ Sample m points with replacement →D₁´ →train model on the sample → M₁ →y₁
→ Sample m points with replacement →D₂´ →train model on the sample → M₂ →y₂
→ Sample m points with replacement →D₃´ →train model on the sample → M₃ →y₃
→ Sample m points with replacement →D₄´ →train model on the sample → M₄ →y₄
…

At last, we combine all the M₁, M₂, M₃ …. Mₖ to a big model M.

D₁´, D₂´, D₃´, D₄´… are known as Bootstrap samples.
The process of combining M₁, M₂, M₃ …. Mₖ models to a big model is known as Aggregation.
For classification, we use the majority vote technique at the aggregation and mean/medium technique for regression.
Models Mᵢ´s are build using the same sample size but of different samples. So, we can say that each model has seen a different subset of samples.

Bootstrapped models are generally high variance and low bias models. So, basically, the whole process of bootstrapping aggregation is to combine low bias and high variance models such that the model that we get at the last stage is low bias and low variance model. Therefore, Bagging can reduce variance in the models without impacting the bias. High variance and low bias models are obtained if the depth of the models is considerably large.

Random Forest is the most used and most popular bagging algorithm. Actually, people thin that RF is the only bagging algorithm but that is not true. Bagging is the technique and Random Forest is built using that technique.

Little into Random Forests

Random in Random Forests is due to random sampling of points and Forests is due to a large number of Decision trees(DT) that are aggregated to create a large model.

Random Forests=DT(base learners of reasonable depth) + Bagging(row sampling with replacement, RS) + Feature Bagging(column sampling, CS) + aggregation

In the case of random forests, not only rows are sampled but also columns. More is the difference in the datasets in terms of quantity and features, more different your models would be and more accurate the results you would get. It also indicates that if we will remove some unwanted data points from the original dataset, then fewer models would get affected as a small number of models would have those points because of row sampling and column sampling.

Random Forests do not perform well in case of the high dimensionality of data, categorical variables with many categories.

Out of bag points

Let D be the dataset. After RS + CS, let D´ be the set of data points obtained to train the models M´. The rest of the points i.e. D-D´ points are used for cross-validating the model M´and is also known as out of bag points.

Hyper-parameters in RF

When we talk of RF, we generally discuss bias-variance trade-off. RF consists of low bias and high variance base learners.

M = aggregation(M₁, M₂, M₃ …. Mₖ),

As the number of K increases, variance in the over-all result decreases while the bias of base learners is generally kept the same as of M.
As the row sampling rate(RSR) and the column sampling rate(CSR) decreases, variance also decreases. But we should not reduce the sampling rates to such an extent that bias of the model increases. We can decide the K parameter through the performance/K graph using the cross-validation dataset.

So, K , RSR and CSR are the major hyper-parameters.

Little into Extremely Randomized Trees

In case of any feature consisting of real values,

if you model consists of regular DT´s, then we will sort the values first and then consider each value of the feature as a threshold in order to make decisions. We take the point as a threshold which has maximum information gain or an overall reduction in I𝓰/H. But the point to be noticed is that we consider each and every value of the feature as threshold once.
In the case of extremely randomized trees, instead of considering every point as a threshold, we randomly selected a few points from the possible values and then carry on the process. In these trees, randomization is a way to reduce the variance.

Extremely Randomized Trees=DT(base learners of reasonable depth) + Bagging(row sampling with replacement, RS) + Feature Bagging(column sampling, CS) + aggregation + randomized “𝜏”

— Boosting

Boosting is the second ensembling technique.

Bagging = low bias and high variance base models + randomization + aggregation

Boosting = high bias and low variance base models + additively combination of models in order to reduce the bias. High bias basically means high training error.

Consider a situation where you are training a model to predict 12 as output. You first train the model and it predicts 8 as output. Now, you again train the model with some constraints like regularization or more complex architecture and model predicts 10 this time. Now, you will keep on increasing various constraints to the model in order to achieve higher and higher accuracy.
NOW, consider the same above situation. At first, your model predicted 8 as output. Now, rather than increasing the constraints to the model, what if you train your model on the residual error, in the above case 12–8=4. Now, we will train our model to predict 4 rather than predicting 12. Let us assume the model predicts 3 in the second run. Now, the overall error is 12-(8+3) = 1. We will again train the model to predict 1 in the final pass and hurrah, we get the most accurate results. This is all about the boosting technique. We are additively adding the residual errors of the models to get best of the best results.

Now, let us understand the above concept in more technical terms, the base of which we have understood above.

Stage 0

Let D ₜᵣₐᵢₙ{xᵢ, yᵢ}(i=1….n) be the training dataset. M₀(ley y = h₀(x)) be the high bias and low variance model(like shallow decision tree) at stage 0.

Training data for the model = {xᵢ, yᵢ}

Error for the above model(y⁰ₑᵣᵣₒᵣ) = yᵢ-h₀(xᵢ)(assuming the simplest loss function)

Stage 1

M₁(ley y = h₁(x)) be the high bias and low variance model(like shallow decision tree) at stage 1.

Training data for the model = {xᵢ, y⁰ₑᵣᵣₒᵣ}

Error for the above model(y¹ₑᵣᵣₒᵣ) = yᵢ-h₁(xᵢ)

The model obtained at stage 1= α₀* h₀(xᵢ) + α₁* h₁(xᵢ)

Stage 2

M₁(ley y = h₂(x)) be the high bias and low variance model(like shallow decision tree) at stage 2.

Training data for the model = {xᵢ, yᵢ-(α₀* h₀(xᵢ) + α₁* h₁(xᵢ))} — additive addition of models to find the residual error.

Error for the above model(y²ₑᵣᵣₒᵣ) = yᵢ-h₂(xᵢ)

The model obtained at stage 2 = α₀* h₀(xᵢ) + α₁* h₁(xᵢ) + α₂* h₂(xᵢ)

…

Now, we can say that model obtained at Kᵗʰ stage is

k
Σ αᵢ * hᵢ(x) = Fₖ₋₁ + αₖ * hₖ(x)
i=0

Thus, as the training error reduces, bias also reduces.

Now, assume that we are trying to solve a regression problem and the loss we are considering is the square loss.

So, low residual error model obtained at stage k = Fₖ(x)

⇒ residual @ for the stage k+1(eₖ₊₁) = yᵢ-Fₖ(x)

⇒ Input data for the model at stage k+1 = {xᵢ, eₖ₊₁}

In terms of square loss, the residual error would be defined as

L(yᵢ, Fₖ(x)) = (yᵢ - Fₖ(x))²∂(L) / ∂(Fₖ(x)) = ∂(yᵢ - Fₖ(x))² / ∂(Fₖ(x))= -1 * 2 * (yᵢ - Fₖ(x))

Wow, did you observe anything!!!

We can write residuals in the form of the loss functions. Our square loss function can be represented in the terms of the residual error obtained at stage k+1. The most important thing to notice is that we can represent any loss function as long as the loss function is differentiable.

The negative gradient of the loss function(aka residual error) is approximately equal to the residual error.

Thus, earlier the training data for the model = {xᵢ, yᶦₑᵣᵣₒᵣ}

Now, the training data could be taken as = {xᵢ, ∂(L) / ∂(Fₖ(x))}.

This whole idea forms the basis of the Gradient Boosting.

In the case of RF, we cannot minimize hinge loss as we have to minimize the entropy. But for Gradient Boosting, we can minimize any loss as long as the loss function is differentiable.

Just like we show how the square loss can be represented in terms of residual error, likewise, you can check the pseudo residuals for log loss.

Gradient for the logistic loss function

I would ask a question related to this one. I found an example of writing custom loss function for xgboost here…

stats.stackexchange.com

This is just so awesome and so powerful!!!

If you want to get more insights into the Gradient Boosting algorithm, check the image below.

Image source — Wikipedia, M in the image is the total number of base learners.

One extension of GB is GBDT where base learners are either decision stumps or shallow decision trees.

Regularization and shrinkage parameter

Now, when we are using GB, one primary concern is to figure out the number of base learners to use. As the number of base learners increases, the variance of the whole model increases and therefore the model starts to overfit. Thus, we need to take care of the number of the models i.e. K . You can predict the value using the cross-validation.

An important part of the gradient boosting method is regularization by shrinkage which consists of a learning parameter as shown below.

Model at stage k = Fₖ₋₁ + ν * αₖ * hₖ(x), where parameter ν (0< ν ≤ 1) is called the “learning rate”.

It has been found that using small learning rates such as ν < 0.1 yields dramatic improvements in the model’s generalization ability over gradient boosting without shrinking(ν = 1).

⚠️ XgBoost implementation of the GB technique is one of the best implementations. Check out the library below.

Python API Reference - xgboost 1.1.0-SNAPSHOT documentation

This page gives the Python API reference of xgboost, please also refer to Python Package Introduction for more…

xgboost.readthedocs.io

— Stacking Models

This the third ensemble technique. It is somewhat similar to the bagging technique with few differences. It consists of very different models like SVM, RF, GBDT, KNN, LR, etc.

In this technique, various models are trained parallelly and independently. Models used in the stacking process are well-trained and perfect models. They have a good bias-variance trade-off.
The more different are the models, the better the stacking model work.
Meta-classifier used in the end is also a well-trained model like logistic regression or linear regression.

Read about the stacking models below:

StackingClassifier

An ensemble-learning meta-classifier for stacking. from mlxtend.classifier import StackingClassifier Stacking is an…

rasbt.github.io

In general, stacking techniques are used in Kaggle competitions where 0.01 difference in accuracy makes a whole lot of differences. In the real world, we do not use these techniques because of its being computationally expensive.

— Cascading Models

Cascading models are generally used where the cost of making a mistake in very-very high. Like in credit card fraud detection, cancer prediction, etc.

It generally consists of models in cascade fashion where the later models are trained on data points which are not corrected predicted by the former models based on if-else conditions.

Input → M₁ (if p(ŷ) > 0.99⇒ ŷ=0 else) → M₂ (if p(ŷ) > 0.90 ⇒ ŷ=0 else) → M₃ (if p(ŷ) > 0.95 ⇒ ŷ=0 else) M4

…

There are many ways to cascade the models based on if-else conditions.

Wind-ups

That's all on ensembling techniques. I would suggest the readers explore the ensembling techniques. In general, these techniques are used mainly.

References:

https://en.wikipedia.org/wiki/Gradient_boosting