Gradient Boost Implementation = pytorch optimization + sklearn decision tree regressor

In order to understand the Gradient Boosting Algorithm, effort has been made to implement it from first principles using pytorch to perform the necessary optimizations (minimize loss function) and calculate the residuals (partial derivatives with respect to predictions) of the loss function and decision tree regressor from sklearn to create the regression decision trees.

An updated version of the medium story can be found in the following kaggle notebooks :
https://www.kaggle.com/code/igtzolas/inventing-gradient-boosting-regression
https://www.kaggle.com/code/igtzolas/inventing-gradient-boosting-classification

Algorithm Outline

The outline of the algorithm can be seen in the following figure:

We will try to implement this step by step and also try to understand why the steps the algorithm takes make sense along the way.

Explanations + Code snippets

Loss function minimization

What is crucial is that the algorithm tries to minimize a loss function, be it square distance in the case of regression or binary cross entropy in the case of classification.

The loss function parameters are the predictions we make for the training examples : L(prediction_for_train_point_1, prediction_for_train_pont_2, …., prediction_for_train_point_m).

pytorch is used to minimize the loss function as follows:

Calculating the residuals

What are the so called residuals?

They are the first order partial derivatives of the loss function with respect to the current predictions for thedata point examples.

In plain words the residuals are vectors (directions + norms) of where should the value of the prediction up to now change so that the loss function gets minimized. In our case the direction is provided by the sign of the residual (since it is a one dimensional scalar). It makes sense that data points with have same loss function minimization tendencies are grouped and treated in the very same way (into the same treatment bucket as the leaf of the tree)! This is the most important conceptual intuition of the algorithm. Why the decision trees are being constructed on the partial gradients with respect to current predictions of the loss function/objective.

The code to calculate the derivatives/residuals is the following:

Constructing the decision regression trees

Decision tree regressor is used to create the regression trees. The decision paths are kept because a single prediction will be calculated for the data that fall within each one of the decision paths/leaf buckets.

Putting all the pieces together and creating an sklearn like class

The algorithm implementation is provided below making use of the already defined code. At each step, the residuals are fed to a regression tree and then the predictions for the data points get updated based on the value that minimizes the loss function on the data of the same bucket (leaf).

The code in its entirety can be found in the following github link.

Using the custom implementation in an example and Comparison to vanilla algorithm results

The following example makes use of the custom and the vanilla sklearn implementations to perform a classification task and compare the results:

The output of the program is :

Custom Implementation : Accuracy score for training data : 0.7661290322580645
Custom Implementation : Accuracy score for testing data : 0.625

Vanilla Implementation : Accuracy score for training data : 0.7580645161290323
Vanilla Implementation : Accuracy score for training data : 0.625

Thank you very much for taking the time to read through this thread!