Gradient Boosted Decision Tree — Clearly Explained

9 min readDec 21, 2023

The primary concept behind boosting is that a group of weak learners can become strong when combined. Boosting has its origins in the 1990s when Robert Schapire and Yoav Freund conducted the first experiments with Adaboost. Around the same time, Jerome Friedman introduced Gradient Boosted Decision Trees, also known as GBDT, to the world. These continuous efforts have led to the development of one of the most powerful and robust machine learning algorithm — Boosting. Boosting, in essence, involves learning from your own mistakes.

In this blog, we aim to break down the fundamentals of Boosting Algorithms and understand what sets them apart. We will focus on one specific boosting algorithm, GBDT, and learn about its end-to-end training process.

This blog assumes that you have the knowledge of Decision Trees and the math behind. For the purpose of simplicity, we will talk about a hypothetical regression problem statement throughout the blog.

What is Boosting?

Boosting is a kind of ensemble technique i.e a technique that uses predictions of various base learners, which are individually weak, in order to give a final prediction. Time and now, boosting algorithms have proven their versatility on a wide range of datasets. They have also been able to handle issues of overfitting, imbalance and missing value per se. However, one of the most common of them in practice is — GBDT. Therefore, I felt the need to write this blog to deep dive into what boosting has to uniquely offer.

To keep it simple, all of the boosting algorithm follows almost a similar pattern of training:

Initialise the leaf with a constant prediction R₀
Compute the residual (difference between observed & predicted value) for all data points
Create a regression tree F₀(xᵢ), to predict these residuals where i = number of datapoints
Calculate new predictions across all datapoints (R₀ + F₀(xᵢ))
Repeat step 2 to step 4

Note that this only holds true when the loss function is L = ½ * [yᵢ — F(xᵢ)]². We’ll talk in detail about this later in the blog

Let’s go through each of these steps in depth and understand how are they implemented in a GBDT.

Gradient-Boosted Decision Trees (GBDT)

Let’s assume we have a dataset with features — gender, weight and height and target feature (yᵢ) is age.

Step 1

The initial prediction for a GBDT = Mean(yᵢ) = Mean of ‘age’ column = 39. So currently the tree only has one root node with value 39. Therefore, predicted value for all the data points is 39. Let’s call the first tree F₀(xᵢ), which only has one node.

Step 2

The residual is calculated as the difference between the predicted value and the actual value for every data point.

The predicted value from the 0th tree = Value of the root node = 39

df['residual_0'] = df['age'] - df['prediction_0']

Step 3

The regression tree F₁(xᵢ) is created in order to predict the residuals across all i datapoints i.e the next tree is built on the errors of the previous tree. This tree is built like any other Decision Tree where each node can be selected using methods like Gini Impurity or Entropy and the algorithm tries to identify the optimum values of thresholds for each node.

Once you reach the leaf node, the final output of the leaf node becomes the average of all the residuals in that leaf node. Since 3 out of 4 leaf nodes have only 1 observation therefore, their value remains the same. For the node with residuals = 11,-17,-14 — the output = average(11,-17,-14) = -6.67

Note that, it is only for the specified loss function that the value of leaf node becomes the average of all observations in the leaf.

Step 4

The new predictions are calculated across all i datapoints where every new prediction is the sum of previous prediction and the prediction of current tree. In essence, the every new prediction is an attempt to be better than previous prediction.

Hence the new output (prediction_1) in this case is :

prediction_1 = R₀ + output(F₁(xᵢ))
prediction_1 = prediction_0 + output(F₁(xᵢ))
prediction_1 = 39 + output(F₁(xᵢ))

Let’s take a look at prediction_1 for i = 2

The red arrows show the path that i=2 will traverse

prediction_1 = 39 + output(F₁(xᵢ))
prediction_1 = 39 + 5
prediction_1 = 44

Hence the new prediction for i=2 is 44 which is exactly equal to the target age that the model is suppose to predict in the first place. Clearly, the model is overfitting. The thing with boosting is that the model can quickly fit, then overfit the training dataset. Therefore, we add a ‘learning rate’ parameter, also known as shrinkage, to allow to model to take smaller step towards the right direction. Smaller the learning rate, smaller are these steps.

Let’s assume learning rate, lr = 0.1

prediction_1 = R₀ + (lr*output(F₁(xᵢ)))
prediction_1 = prediction_0 + (lr*output(F₁(xᵢ)))
prediction_1 = 39 + (0.1*output(F₁(xᵢ)))

Therefore the new prediction for i=2

prediction_1 = 39 + (0.1*output(F₁(xᵢ)))
prediction_1 = 39 + (0.1*5)
prediction_1 = 39.5

We can see that we have taken a small step in the right direction, making the prediction closer to the target value of 44 and better than the previous prediction 39. We calculate this for all other datapoints.

The predictions after making tree F₁(xᵢ) are stored in ‘prediction_1 ’ column

It might seem odd, because the tree should be built on what we are trying to predict i.e age and not on the residuals. But this is the model’s way to learn from its own mistakes in order for every subsequent tree to be better than the previous one, hence boosting.

We need to understand that after every step the model is trying to quantify how much improvement should it need to have so as to be better than the previous prediction. And it is quantifying this by looking at how much wrong it was the last time it made prediction.

Step 5

We repeat the step 2 and compute the new residuals for all the datapoints based on the latest prediction_1.

We can see that the new predictions are not very close to the target value in the ‘age’ column, but certainly better than prediction_0. We now build next tree on the residuals from new prediction, residual_1 and repeat the other steps unless we reach the max_tree value or we observe very less changes in residuals.

prediction_n = R₀ + (lr*output(F₁(xᵢ))).... + (lr*output(Fₙ(xᵢ)))
where every Fₖ(xᵢ) is built on residuals of (k-1)th prediction

But where have we computed gradient in this entire process? — The Real GBDT Math

Gradient is computed as the derivative of the loss function. The term ‘residual’ that we have used for the convenience of calculation is nothing but the gradient if we take the loss function as L = ½ * [yᵢ — F(xᵢ)]².

Let’s see how.

If I have to reiterate over the actual steps involved in GBDT:

Step 1

Choose a differentiable loss function, L(yᵢ, f(xᵢ)). The derivative of this loss function is the term ‘gradient’ in gradient boosting.

Step 2

Initialise a leaf with a constant prediction or base prediction. The value of this constant should be a number that minimises the loss across all i. And we know that the loss is minimum when the derivative of loss function is set to 0. Remember, that we took the mean of Age column as the base prediction? Let’s see why.

Let's call the base prediction R₀:

-> L(yᵢ, f(xᵢ)) = L(y₁, R₀) + L(y₂, R₀) + …… L(yᵢ, R₀)

Let's assume L = ½ * [yᵢ - F(xᵢ)]²
-> L(yᵢ, f(xᵢ)) = ½ * [y₁ - R₀]² + ½ * [y₂ - R₀]² + …… ½ * [yᵢ - R₀]²

Since we have to minimise L(yᵢ, f(xᵢ)), we will take it's derivative 
and equate it to 0, to find the value of R₀

For eg. derivative(½ * [y₁ - R₀]²) = [y₁ - R₀] = Residual wrt y₁

-> [y₁ - R₀] + [y₂ - R₀] + ... [yᵢ - R₀] = 0
-> R₀*i = y₁ + y₂ .. + yᵢ
-> R₀ = (y₁ + y₂ .. + yᵢ) / i 
-> R₀ = Mean of target column

Hence, for L = ½ * [yᵢ — F(xᵢ)]² the value of R₀ comes out to be the mean of the target column, in our case Mean(df[‘age’]) which we had taken as 39. The value of base prediction will be different for any other differentiable loss function.
We also see, that the value of gradient for this loss function is equal to residual

Step 3

Once you have the prediction, then compute the negative derivative or gradient of loss function L(yᵢ, f(xᵢ)) wrt f(x) for all data points in i. As we saw above, the value of iᵗʰ derivative for L = ½ * [yᵢ — F(xᵢ)]² = residual. For example,

When i = 1
Loss = (½ * [y₁ - R₀]²)
derivative(Loss)= derivative((½ * [y₁ - R₀]²)) = [y₁ - R₀] = Residual wrt y₁

Hence, at every step we were calculating the residual and building a gradient boosted tree over residuals.

In a more general term, every tree in GBDT is built to understand the extent of error of the last prediction. The extent of error can be quantified by the derivative of loss function i.e how much or how less should the prediction be in order to have minimum loss. This derivative is what we call the gradient. And the value of this gradient for Loss = ½ * [yᵢ — F(xᵢ)]² is equal to abs[yᵢ — F(xᵢ)] or the residual.

Step 4

Fit a regression tree to predict gradients calculated in Step 3. Let’s call the first tree as F₁(xᵢ). Once the tree is built, evaluate the output for every leaf node since leaf nodes can have more than one gradient bucketed in them.

The output for every leaf should be a number that minimises the loss wrt all the observations that fall in that leaf. This is to ensure that the particular leaf node is individually contributing to minimise the loss. And so is the case with every other leaf.

This minima will be achieved when the derivative of loss function wrt the observations in that leaf is set to 0. Note that this is exactly same as what we did in Step 2.

Remember, that we took the mean of observations in leaf node as the output of leaf? For the node with residuals = 11,-17,-14 — the output was average(11,-17,-14) = -6.67. Let’s see why.


Let's assume L = ½ * [yᵢ - F(xᵢ)]²

-> L(yᵢ, f(xᵢ)) = L(y₁, prediction_1₁) + L(y₂, prediction_1₂) + …… L(yᵢ, prediction_1ᵢ)
where, prediction_1ₖ = Value of prediction from 1st tree for kth observation

Since, we need to minimise loss for observations in that leaf
-> Lբ =  ½ * [y₁ - prediction_1₁]² 
       + ½ * [y₂ - prediction_1₂]² 
       + ½ * [y₆ - prediction_1₆]²
where Lբ = Total loss for a leaf with 1st, 2nd and 6th observation

We also know that, 
prediction_1 = prediction_0 + (lr*output(F₁(xᵢ)))
prediction_1 = prediction_0 + γ

-> Lբ =  ½ * [y₁ - (prediction_0₁ + γ)]² 
       + ½ * [y₂ - (prediction_0₂ + γ)]² 
       + ½ * [y₆ - (prediction_0₆ + γ)]²
Note that for first tree prediction_0₁ = prediction_0₂ = prediction_0₆ = prediction_0

Since we have to minimise Lբ, we will take it's derivative 
and equate it to 0

For eg. derivative(½ * [y₁ - (prediction_0₁ + γ)]²) 
= [y₁ - (prediction_0₁ + γ)] = residul_0₁ - γ = 11 - γ

-> [y₁ - (prediction_0₁ + γ)] + [y₂ - (prediction_0₂ + γ)] + [y₃ - (prediction_0₆ + γ)] = 0
-> [y₁ - (prediction_0 + γ)] + [y₂ - (prediction_0 + γ)] + [y₃ - (prediction_0 + γ)] = 0
-> [residul_0₁ - γ] + [residul_0₂ - γ] + [residul_0₆ - γ] = 0
-> γ*3 = residul_0₁ + residul_0₂ + residul_0₆
-> γ = (residul_0₁ + residul_0₂ + residul_0₆) / 3
-> γ = Mean of all the residuals in that leaf

Step 5

Calculate the new prediction for 1st tree i.e. prediction_0 + (lr*output(F₁(xᵢ))) for all the i datapoints.

Repeat Step 3 to Step 5 for all subsequent trees.

Hope this helps you intuitively understand the role of Gradients in learning of a Boosting algorithm.

References

Gradient Boosted Decision Tree — Clearly Explained

What is Boosting?

Gradient-Boosted Decision Trees (GBDT)

Step 1

Step 2

Step 3

Step 4

Step 5

But where have we computed gradient in this entire process? — The Real GBDT Math

Step 1

Step 2

Step 3

Step 4

Step 5

Written by Ruchi