Mathematics behind Gradient Boosting for Regression
Performs better than ada boost
Now lets understand its working via dataset
Here we will use 3 base models
In Gradient boosting, if we are working on regression problem, our first model is nothing but the mean of the output column which here is salary. It is not a machine learning model.
It means whatever the data will be it will always give the avg of 3,4,8,6,3 which is 4.8
Now we need to check the performance of model1, and to check that we need a loss function
The loss function which we will use here is PSEUDO-RESIDUAL which is nothing but ACTUAL VALUE — PREDICTED VALUE
Our model 2 will be a decision tree (we can use any other algorithms also but generally decision trees are used as it gives better results compared to other algorithms).
Now in our model 2 the input data will be iq and cgpa but the output data will be res1 and not salary.
It means we are asking the model that predict what mistakes are being done by model1.
Now lets say, the decision tree for model 1 looks like this:
Now calculating pred2 from our above decision tree
Now if we would have only 2 base models then our prediction of salary would have been model1 output + model2 output
So here for data point (iq = 90 and cgpa = 8) model1 output is 4.8 and model2 output is -1.8 = 3. Similarly for data point (iq = 100 and cgpa = 7) the predicted output will 4.8–0.8 = 4 which is same as the actual output
Now as we can see here, the predicted values are exactly same as actual values. So here overfitting is being done. To avoid this we will use the concept of learning rate
So here the predicted value will actually be model1 output + alpha1*(model2 output).
Lets keep the value of alpha1 as 0.1
So now for data point (iq = 90 and cgpa = 8) the predicted output will be 4.8 + (0.1)*(-1.8) = 4.62
here we can see that res2 is less than res1. Ideally our residual = 0 because it is nothing but actual — predicted value.
Now for our model3 the input will be iq, cgpa and the output column will be res2.
Now lets assume our model3 decision tree looks like this:
Now here calculating pred3 with the help of above decision tree
Now at this point our y_pred formula will be
y_pred = m1output + (alpha1)*(m2output) + (alpha2)*(m2output)
here alpha1 or learning rate for each model will be same which here is 0.1
Now for a student with (iq = 60 and cgpa = 4.9) his salary will be
4.8 + (0.1)(-1.8) + (0.1)(-1.62) = 4.5
The values -1.8 and -1.62 here are calculated by decision trees for model2 and model3 respectively
Adaboost vs gradient boost
- Maximum leaf nodes
For Ada Boost : 2 (decision stumps are used)
For Gradient Boost : 8–32 (decision trees are used)
2. Learning rate/weights
For Ada boost : Each model is assigned weights and weights decide the importance given to each model
For Gradient Boost : Here learning rate for every model is same