Gradient Boosting | What ? Why ? How ?

Published in

DS/AIwithRishabh

6 min read2 days ago

So first we will talk about How the Gradient Boosting algorithm works and what is it at last we will see whatever this algorithm is doing and why it is doing so.

What is Gradient Boosting ?

Gradient Boosting is a machine learning ensemble technique that aims to build a strong predictive model by combining the predictions of several weaker models using the concept of Additive Modelling, typically decision trees.

Additive Modelling ( F(x) = h1(x) + h2(x) + h3(x))

For More detail about Additive Modelling you can read this blog

The method works by iteratively adding models to the ensemble, with each new model trained to correct the mistakes made by the combined ensembles in the existing models. In this case, weaker models don’t have good results on the training data.
Gradient Boosting is a very powerful and flexible method that can be used for both regression and classification.
It is particularly effective when the data has complex, non-linear relationships and it often performs well even with little hyperparameter tuning.
Its popularity in real-world applications and machine-learning competitions is a testament to its effectiveness and it has an implementation in most major machine-learning libraries, such as scikit-learn, XGBoost, LightGBM, and CatBoost.

Before starting learning about the whole algorithm in detail let me clarify to you one thing that understanding mathematics and notations around this algorithm is little tricky so be patient and take your time to read and understand the algorithm.

Let’s see Step — 01

step -1

So, firstly here we are trying to get the first prediction value to start with for getting that we need to go back in mind to the concept we used in linear regression. Well our main motive is to minimize the loss function right ?

Generally, when we were learning Gradient Descent or OLS we were calculating the best parameter values b0,b1,b2.. such that minimizing the loss function.

In that case it was parametric machine learning algorithm we assumed our mathematical function to be linear y(predicted) = b0 + b1*x1 + b2*x2 + b3*x3 .. . Loss was minimized their in the parameter space as we were calculating parameters in (N+1) dimensional space.

But When it comes to the gradient boosting algorithm we are not making any linear assumptions about our data, we don’t have any idea of what the mathematical function could be, as the data we will deal with can be more complex not sure about linear.

So here instead of minimizing loss in parameter space we will minimize loss in function space if you guys want to know in more detail about function space vs parameter space read this blog. Well now minimizing loss in function space means here our mathematical function is variable and which can be any function and now we will start with fix value for all the N data points in our dataset (Y(gama)) and then find the direction to minimize the loss function and get initial value of prediction(gama).

Further we will get to know that direction and magnitude in which the loss function can be minimized is calculated by -(slope) which we have seen in gradient descent in parameter space here in function space it will come out to be residual=-(slope).

here is the calculation of first prediction (F0(x))

Now, we got the initial prediction and also the reason behind why initial prediction is taken as mean of y.

Step- 2(a)

for each decision tree we got through four steps step(a) is this

We are calculating residuals rᵢ𝑚 by taking a derivative of the loss function with respect to the previous prediction F𝑚-₁ and multiplying it by −1. As you can see in the subscript index, rᵢ𝑚 is computed for each single sample i(each row). Some of you might be wondering why we are calling this rᵢ𝑚 residuals. This value is actually negative gradient that gives us guidance on the directions (+/−) and the magnitude in which the loss function can be minimized. You will see why we are calling it residuals shortly. By the way, this technique where you use a gradient to minimize the loss on your model is very similar to gradient descent technique which is typically used to optimize neural networks. (In fact, they are slightly different from each other. If you are interested, please look at this article detailing that topic.)

Let’s compute the residuals here. F𝑚-₁ in the equation means the prediction from the previous step. In this first iteration, it is F₀. We are solving the equation for residuals rᵢ𝑚.

That leaves us rᵢ𝑚 = yᵢ − F𝑚-₁. You might now see why we call it residuals. This also gives us interesting insight that the negative gradient that provides us the direction and the magnitude to which the loss is minimized is actually just residuals.

Step-2(b)

j represents a terminal node (i.e. leave) in the tree, m denotes the tree index, and capital J means the total number of leaves.

Step-2(c)

We are searching for γⱼ𝑚 that minimizes the loss function on each terminal node j(i.e for each terminal region). Σxᵢ∈Rⱼ𝑚 L means we are aggregating the loss on all the sample xᵢs that belong to the terminal node Rⱼ𝑚(or belong to jth terminal region). Let’s use the loss function into the equation.

Please note that nⱼ means the number of samples in the terminal node j. This means the optimal γⱼ𝑚 that minimizes the loss function is the average of the residuals rᵢ𝑚 in the terminal node Rⱼ𝑚. In other words, γⱼ𝑚 is the regular prediction values of regression trees that are the average of the target values (in our case, residuals) in each terminal node.

Step-2(d)

In the final step, we are updating the prediction of the combined model F𝑚. γⱼ𝑚1(x ∈ Rⱼ𝑚) means that we pick the value γⱼm if a given x falls in a terminal node Rⱼ𝑚. As all the terminal nodes are exclusive, any given single x falls into only a single terminal node and corresponding γⱼ𝑚 is added to the previous prediction F𝑚-₁ and it makes updated prediction F𝑚.

Well learning rate is also something which you can use which ranges between 0 and 1 which controls the degree of contribution of the additional tree prediction γ to the combined prediction F𝑚. A smaller learning rate reduces the effect of the additional tree prediction, but it basically also reduces the chance of the model overfitting to the training data.

And Now as, we can see the concept of additive modelling is helping us to get the strong prediction using M number of weak decision trees predictions.

I hope you guys got some understanding of what gradient boosting is and how it’s working and why some of weird mathematics happening in the algorithm if you are confused yet.

Here are few sites that you can go through to strength your concepts more on this algorithm :

Follow me on LinkedIn and this page for more such content

Gradient Boosting | What ? Why ? How ?

Written by Rishabh Agrawal