Introduction To Gradient Boosting Classification

Published in

Analytics Vidhya

6 min readDec 24, 2020

Boosting

Boosting is an ensemble method that combines several weak learners into a strong learner sequentially. In boosting methods, we train the predictors sequentially, each trying to correct its predecessor.

Gradient Boosting

Gradient Boosting is the grouping of Gradient descent and Boosting. In gradient boosting, each new model minimizes the loss function from its predecessor using the Gradient Descent Method. This procedure continues until a more optimal estimate of the target variable has been achieved

Unlike other ensemble techniques, the idea in gradient boosting is that they build a series of trees where every other tree tries to correct the mistakes of its predecessor tree.

For understanding gradient boosting, try thinking about a golfer whacking a golf ball towards the hole, covering a certain ground distance on every shot. He repeatedly hits the ball, working his way towards the hole by reassessing the direction and magnitude after each stage.

After each stroke, the golfer determines the appropriate nudge by computing the difference between actual hole position and the approximation made by him. This is known as residual vector. This golfer case is similar to what we perform in gradient boosting, trying to get as close to target variable as possible.

Components of Gradient Boosting

Loss function

The objective of the Gradient Boosting algorithm is to minimize the loss function i.e. the difference between the actual class and the predicted class. Classification algorithms frequently use logarithmic loss function whereas regression algorithms use squared errors. They are many standard loss functions supported by gradient boosting classifiers on the condition that the loss function should be differentiable.

Weak Learners

Typically, Decision trees are used for weak learners. These classify our data poorly which means they have a higher error rate. They are individually insufficient to make predictions.

Additive Component

As more learners are added into the model, the output of trees can be added together to minimize the loss in the prediction. This process employs a similar procedure to gradient descent on the calculated loss, thereby reducing the loss iteratively.

STEPS TO GRADIENT BOOSTING CLASSIFICATION

STEP 1: Fit a simple linear regression or a decision tree on data [𝒙 = 𝒊𝒏𝒑𝒖𝒕, 𝒚 = 𝒐𝒖𝒕𝒑𝒖𝒕]

STEP 2 : Calculate error residuals by subtracting predicted target value from actual target value. [𝒆𝟏 = 𝒚𝒕𝒓𝒖𝒆 −𝒚𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅𝟏]

STEP 3 : Fit a new model on the error residuals as the target variables keeping the input variables same.[𝒆𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅𝟏]

STEP 4 : Add the predicted residuals to previous predictions [𝒚𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅𝟐 = 𝒚𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅𝟏 + 𝒆𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅𝟏 ]

STEP 5 : Fit the next model on the remaining residuals. [𝒆𝟐 = 𝒚𝒕𝒓𝒖𝒆 − 𝒚𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅𝟐]

Repeat steps 2 to 5 until the model starts overfitting or there is no change in the residuals sum

Example

Let us try to understand gradient boosting classification by considering a sample dateset. It has several input features 𝑥1, 𝑥2, 𝑥3, and tries to predict the target variable 𝑦, which is a binary output.

Step 1: Make initial guess using log of the odds of target variable.

To do classification, we apply softmax transformation.

Step 2: Calculate error residuals or pseudo residuals by subtracting prediction from the observed values. Hence the table looks like following :

Step 3: Compute Classification tree.

This is an example of classification tree with just two leaves. However, Gradient Boosting often has more than 5 leaves and many leaves can have multiple values. Therefore, Gradient Boosting uses transformation for Classification. Consider the following tree:

Therefore, the value of second leaf is given by the following transformation

Step 4: Make the prediction.

Learning rate defines the contribution of the new tree. Now new log odds prediction can be converted to probability using softmax function.

As you can see, the probability has diminished from previous log-odds ratio.

Step 5: Repeat steps until the model starts over-fitting or there is no change in the residuals sum.

Hyperparameter Tuning

One of the main features of gradient boosting is that it offers tuning of several parameters. However, it can be a challenging task to find an optimal combination of parameters and is often time- consuming. The most common hyper parameters are as follows:

Learning rate

It controls the contribution of trees and how quickly the algorithm converges using gradient descent. Smaller values reduce the chance of overfitting but the algorithm takes longer times to converge. Larger values make the algorithm fast but may never reach an optimal fit.

Number of trees

The objective here is to find the optimal number of trees using cross validation so as to minimize the loss function. Early stopping can be employed to find the optimal number of trees.

Depth of trees

The number of splits in the trees as known as the depth of the trees (d). Often algorithm works well d=1, however, it can be greater than 1 but always less than 10.

Subsampling

It specifies the fraction of training instances that are used for the training of each tree. If subsample = 0.20, then 20% of training instances selected randomly are used for training.

Advantages:

Flexibility

Gradient Boosting can support various loss functions and provides a number of hyperparameters tuning options which it makes it very flexible

No Data-preprocessing

It works great with numerical as well as categorical features as it is.

Handles missing data

No data imputation is required

Disadvantages:

1. Over fitting

It minimizes all errors, hence prone to over-fitting. One must use cross- validation to neutralize

2.Computationally Expensive

This technique often requires many trees; hence it can be time and memory exhaustive

Conclusion

Gradient Boosting is one of the most powerful ensemble algorithms that is most appropriate for both regression and classification tasks. However, they are prone to overfitting but various methods discussed above can be employed for dealing with overfitting.

They are more computationally demanding than other machine learning algorithms, but a successful data scientist or a machine learning engineer essentially should know the working of a gradient boosting algorithm.