Mathematics behind Gradient Boosting for Regression

4 min read6 days ago

Performs better than ada boost

Now lets understand its working via dataset

Here we will use 3 base models

In Gradient boosting, if we are working on regression problem, our first model is nothing but the mean of the output column which here is salary. It is not a machine learning model.

It means whatever the data will be it will always give the avg of 3,4,8,6,3 which is 4.8

Now we need to check the performance of model1, and to check that we need a loss function

The loss function which we will use here is PSEUDO-RESIDUAL which is nothing but ACTUAL VALUE — PREDICTED VALUE

Our model 2 will be a decision tree (we can use any other algorithms also but generally decision trees are used as it gives better results compared to other algorithms).

Now in our model 2 the input data will be iq and cgpa but the output data will be res1 and not salary.

It means we are asking the model that predict what mistakes are being done by model1.

Now lets say, the decision tree for model 1 looks like this:

Now calculating pred2 from our above decision tree

Here as we can see pred2 = res1. This happened as our dataset is very small which caused overfitting

Now if we would have only 2 base models then our prediction of salary would have been model1 output + model2 output

So here for data point (iq = 90 and cgpa = 8) model1 output is 4.8 and model2 output is -1.8 = 3. Similarly for data point (iq = 100 and cgpa = 7) the predicted output will 4.8–0.8 = 4 which is same as the actual output

Now as we can see here, the predicted values are exactly same as actual values. So here overfitting is being done. To avoid this we will use the concept of learning rate

So here the predicted value will actually be model1 output + alpha1*(model2 output).

Lets keep the value of alpha1 as 0.1

So now for data point (iq = 90 and cgpa = 8) the predicted output will be 4.8 + (0.1)*(-1.8) = 4.62

Here res2 is calculated by (actual — predicted). For 1st row res2 = (3 — (m1output + (alpha1 * m2output))) = (3–(4.8 + 0.1(-1.8))) = (3–(4.8–0.18)) = -1.62

here we can see that res2 is less than res1. Ideally our residual = 0 because it is nothing but actual — predicted value.

Now for our model3 the input will be iq, cgpa and the output column will be res2.

Now lets assume our model3 decision tree looks like this:

Now here calculating pred3 with the help of above decision tree

Now at this point our y_pred formula will be

y_pred = m1output + (alpha1)*(m2output) + (alpha2)*(m2output)

here alpha1 or learning rate for each model will be same which here is 0.1

Now for a student with (iq = 60 and cgpa = 4.9) his salary will be

4.8 + (0.1)(-1.8) + (0.1)(-1.62) = 4.5

The values -1.8 and -1.62 here are calculated by decision trees for model2 and model3 respectively

Adaboost vs gradient boost

Maximum leaf nodes

For Ada Boost : 2 (decision stumps are used)

For Gradient Boost : 8–32 (decision trees are used)

2. Learning rate/weights

For Ada boost : Each model is assigned weights and weights decide the importance given to each model

For Gradient Boost : Here learning rate for every model is same

Mathematics behind Gradient Boosting for Regression

Adaboost vs gradient boost

Written by Abhishek Jain