A brief idea on Ensemble Models-II (Gradient Boosting Decision Tree)

A comprehensive guide to Ensemble model for Boosting technique in Machine learning.

Bhanu
Analytics Vidhya
5 min readAug 20, 2020

--

Created with MS PowerPoint

In this blog, we shall understand about Boosting, which is one of the strategies to build an Ensemble model. In my previous blog, I have discussed Bagging strategy in brief, please go through it if you haven’t. Here is the link.

In a simple way, we can relate Boosting technique to one of our famous quotes or lessons in our life : “Learn from your Mistakes

Mistake + Correction = Learning

This is the core idea of how Boosting works. Let’s dive deep to understand better.

Understanding Bias:

The idea of this model is to reduce bias. Bias errors are the errors made due to simplifying assumptions like assuming a plane would separate positive and negative points but curve(non-linear surface) would actually separate them. So, these Modelling assumptions will lead to High bias model which is Under-fitting. Hence we can intuitively take high bias as high training error.

How it works:

In this blog, I will help you to give an intuition on how Boosting works with less of mathematics and more of theory. We take our base models as Low variance and high bias models(shallow depth typically of 1 or 2. Trees of depth 1 are called Decision stumps).

Step 1:

In the first step, we train a model on the whole of the training data. The model trains on top of it and tries to find the right function i.e, fitting a function to the data. So we get the predicted output. We have actual and predicted values, a simple difference between actual and predicted value gives the error. This error obtained is used in the next stage so that the model fits pretty well to reduce the error further.

Step 2:

Instead of training with whole data points Xi and Yi, we only train on Xi and the error which is obtained in the first stage or Step 1 or previous stage(Training the model on (Xi,error_i)). So we are trying to fit a function on errors instead of Yi. Hence the model at the end of this stage would look like,

The weighted sum of two base models h0 and h1

Values of gamma are obtained with loss minimization function(Please refer Wiki for this.)

Step 3:

Now again train on Xi and the error which is obtained in step 2(previous stage) and fit a model on top of it. Likewise, we repeat the steps m times on Xi and the errors which are left at the end of the previous stage(which is a residual error) until we get a low residual error. As the number of stages (=number of base learners, because at each stage we are training a base model) increases, at every stage, we are fitting to the error so that the residual error reduces stage to stage. This error is Bias. Hence we are able to achieve the task of reduced bias.

While solving a loss minimization function, we get residual equal to the negative gradient of the loss function with respect to model at each stage-m. Hence we replace the residuals with Pseudo residuals which helps us to minimize any loss function as long as it is differentiable. This Pseudo residuals is the core idea of Gradient Boosting. Please refer Gradient boosting algorithm from Wikipedia to understand the deeper insights of mathematics. In this blog, I have mainly concentrated on the theory part on how it works.

Hence the final model would look like,

m is the number of base learners

Above equation can be exactly thought as Logistic regression, whereas gamma represents weights in logistic regression whereas here, it can be thought of a constant which minimizes the loss function.

Overfit:

We often overfit in GBDT(Gradient boosting decision trees) with the increase in the number of base learners, because we are trying to fit the function on the errors, hence we are being more accurate on the training data to avoid errors. This will result in high variance and overfit of the model. Hence to avoid this, there is a concept called shrinkage which typically lies in between 0 and 1.

Smaller values of shrinkage allows us to consider previous model outputs partially. Hence this controls overfit of the model.

value of v controls overfit

Hyper-parameters: Number of base models and shrinkage(v)

Code: sklearn.ensemble.GradientBoostingClassifier

XGBoost:

Is there a way to combine the best of Bagging technique(Row sampling and column sampling) and boosting(GBDT)? Yes, XGBoost is the implementation of GBDT where we can achieve higher performance with the same data given when compared to GBDT alone.

GBDTs are not easily parallelized because all the base models are sequentially related, that is data from the previous stage is used for the next stage training.

This is all about brief intuition on how GBDTs work. Can be used easily for low latency applications because each decision tree is of shallow depth so storing and running them is trivial.

Please check the below references to understand a deeper intuition of mathematics. Thank you :)

Please refer this link for next blog on Ensemble model which is Stacking.

References:

https://en.wikipedia.org/wiki/Gradient_boosting

--

--