Intuition and Implementation of Gradient Boost Part-1

Arun Amballa
Analytics Vidhya
Published in
8 min readJul 2, 2020

--

Understanding the Math Intuition and Implementation of Gradient Boost for Regression Problem…!

Boosting

💥Gradient Boosting Algorithm:

Gradient Boosting or GBM (Gradient Boosting Machine) is another ensemble machine learning algorithm that works for both regression and classification problems. GBM uses the boosting technique, combining a number of weak learners to form a strong learner. The models in GBM are built sequentially and each of these subsequent models tries to reduce the error of the previous model. But how does each model reduce the errors of the previous model. That is done by building the models on the errors or the residuals of the previous model.

Gradient Boosting works for both Regression and Classification Problems.

🪐Compare and Contrast AdaBoost and Gradient Boost:

Data Set

If we want to use the above measurements to Predict weight the AdaBoost starts by building a very short tree, called a Stump, from the Training Data and the Amount of say(Amount of Performance) that the stump has on the final Output is based on how well it compensated for those previous errors. Then AdaBoost builds the next stump based on errors that the previous stump made AdaBoost Continues to make stumps in this fashion until it has made the number of stumps you asked for, or it has perfect fit.

In contrast, Gradient Boost starts by making a single leaf, instead of a tree or Stump. This leaf represents an Initial guess for the weights of all the samples. When trying to predict a continuous value like weight, the first guess is the average value. Then Gradient Boost builds a tree, like AdaBoost this tree is based on the errors made by the previous tree but unlike AdaBoost this tree is usually larger than a stump.

Gradient Boost still restricts the size of the tree. In the above example we will build the trees with up to four leaves, but no larger. However, in practice, people often set the maximum number of leaves to be between 8 and 32. Thus, like AdaBoost, Gradient Boost builds fixed sized tree on the previous tree’s errors, but unlike AdaBoost, each tree can be larger than stump. Then Gradient Boost builds another tree based on the errors made by the previous tree this process continues to until it has made the number of trees you asked for, or it has a perfect fit.

💫Intuition behind Gradient Boosting for Regression:

Here we will understand how Gradient Boosting Algorithm works for Regression Problem.

We will use the below data set for understanding the intuition behind GBM where we have the height measurements from six people their favorite Color and Gender to predict their Weights.

📢NOTE: when Gradient Boost is used to predict a continuous value, like weight, we say that we are using Gradient Boost for Regression.

DATA SET

💛STEP 1: Calculate the Average of the Target Variable (Weight):

The first thing we do is calculate the Average Weight. This is the first attempt at predicting everyone’s weight. In other words, if we stopped right now, we would predict that everyone weighed 71.2 kgs. However, Gradient Boost doesn’t stop here. Like i said Gradient Boost starts by making a single leaf, instead of a tree or Stump. This leaf represents an Initial guess for the weights of all the samples. When trying to predict a continuous value like weight, the first guess is the average value.

Leaf

💙STEP 2: Build Tree Based on the Errors of Previous Tree.

The next thing we do is build a tree based on the Errors of the first tree. The Errors that the previous tree made are the differences between the Observed weights and the predicted weights.

Error or Residual=Observed Weight -Predicted Weight. Since the Predicted weight for the first tree or leaf is 71.2 which is same for all the samples. Let’s plugin Observed weights and the predicted weight to get the Error or Pseudo Residual in a new column.

📢NOTE: Use Observed weights as Weights Given in original data set and the predicted weights as 71.2 .

📢NOTE: The term Pseudo Residual is based on Linear Regression, where the difference between the Observed Values and the Predicted values results in Residuals. The ‘Pseudo’ Part of the Pseudo Residual is a remainder that we are doing Gradient Boost, not Linear Regression.

Now we will build a tree as shown below, Using Height, Favorite color and Gender as Independent Variables and the Residual as the Dependent Variable.

📢NOTE: The output of this tree is predicted values for Residuals but not Predicted values for weights.

Tree

In the above example we are allowing only four leaves but when we use large data set it is common to allow anywhere from 8 to 32. By restricting the total number of leaves we get fewer leaves than Residuals. As a result, the third and the six rows of data go to the same leaf (First leaf in the above Image). So, we replace these Residuals with their average. Similarly, four and five rows of data go into same leaf (Third leaf in the above Image). So, after replacing with averages the tree looks as below.

Now we combine the Original leaf and the new tree to make a prediction of an individual’s weight from the Training Data.

🧡STEP 3: Make Predictions on the Complete Training Data

Let’s try to predict the weight of the below given record using the Independent variables.

We start with the Initial Prediction 71.2 then we run the data down the tree and we get 16.8. So, the predicted weight equals =71.2 +16.8 = 88 which is the same as the observed weight.

Is this awesome? No. The model fits the training data to well but it fails to predict when an unseen tuple is tested over the model. In other words, we have low Bias but probably high Variance. Gradient Boost deals with this problem by using a Learning rate to scale the contribution from the new tree. The Learning Rate is a Value between 0 and 1. In this case we will set the Learning rate value as 0.1 we use the same Learning rate for all the trees.

Now if we pass the same tuple over the model. The predicted weight will now be 71.2+(0.1*16.8) =72.9. With the Learning rate set to 0.1, the new prediction isn’t as good as it was predicted without the Learning rate but it’s a little better than the prediction made with just the original leaf, which predicted that all samples would weigh 71.2. In other words, introducing Learning rate results in a small step in the right direction.

📢NOTE: According to the dude who invented Gradient Boost, Jerome Friedman, empirical evidence shows that taking lots of small steps in the right direction results in better predictions with the Testing Data i.e. Lower Variance.

So, Let’s Build another tree so we can take another small step in the right direction. Just like before, we calculate the Pseudo Residuals, the difference between the observed weights and our latest Predictions.

Example for first Record, Residual=Observed Weight- Predicted Weight

= (88-(71.2+(0.1*16.8))

=(88–72.9) where 72.9 is the latest prediction.

= 15.1

Similarly Calculate the Residuals for all tuples in the training Data

📢NOTE: Here the Predicted weight are the weights that are obtained after the sample is passed over the leaf and tree.

📢NOTE: The below are the Original Residuals, from when our prediction was simply the average overall weight and the Residual shown in above image are the Residuals after adding the new tree. The new Residuals as shown in the above image are all smaller than the Residuals shown in the below image, So, we have taken a small step in the right direction.

Now Repeat the STEP 2 and STEP 3 until the Residual becomes 0 or when the number of trees you have asked for is reached.

Now we will build a tree based on Independent Variables Height, Favorite Color and Gender and Dependent Variable Residual.

📢NOTE: Here the Residuals are the new Residuals that are obtained by subtracting latest predictions with the original Weights

After Performing STEP 2 for the above data we get the tree as shown below.

After Performing STEP 3 we get the below Predicted Weights using below tree.

Let’s try to predict the weight of the below given record using the Independent variables.

Now we get the predicted value= 71.2+(0.1*16.8) + (0.1*15.1)

= 74.4

NOTE: I have calculated Predicted value only for one record but we must test using Complete Training Data.

Now we calculate the Residuals and Repeat the STEP 2 and STEP 3 until the Residual becomes 0 or when the number of trees you have asked for is reached.

💥Implementation Using Python

Let’s discuss in comments if you find anything wrong in the post or if you have anything to add…
Thanks.

Credits and Sources:

  1. StatQuest
  2. www. AnalyticsVidhya.com

--

--