Gradient Boosting for Regression from Scratch

Explanation and Implementation of Gradient Boosting in Python

Okan Yenigün
5 min readJul 29, 2022
Photo by Jeremy Bezanger on Unsplash

Gradient boosting is one of the ensemble machine learning techniques. It uses weak learners like the others in a sequence to produce a robust model.

It is a flexible and powerful technique that can be used for both regression and classification problems. Even with minimal tuning, good results can be achieved. It can handle a large number of features and is not biased towards any particular feature type.

On the other hand, it is more sensitive to overfitting than other machine learning methods and can be slow to train, especially on large datasets.

Despite its disadvantages, gradient boosting is a popular method for many machine learning tasks, due to its flexibility, power, and relatively good performance.

In this blog post, I will examine the application of the gradient boosting technique to regression problems. In another article, I will address the topic of classification problems.

Before proceeding, it might be beneficial to refresh your understanding of decision trees and another ensemble technique, AdaBoost.

As you may recall, AdaBoost used decision trees with a depth of 1 called a stump. Each new stump decreases or increases the weights of observations according to the previous stump’s error.

Gradient Boost, on the other hand, starts with a single leaf first, an initial guess. Later, it builds trees. However, unlike AdaBoost, these trees are usually larger than a stump. People usually use decision trees with 8 to 32 leaves in this technique. Again, unlike AdaBoost, the Gradient Boosting technique scales trees at the same rate.

Pseudocode

To begin with, we have a dataset of x, observations, and y, target features. In addition, we have a differentiable loss function. We use a loss function to evaluate our estimations.

The loss function is just the sum of squares of residuals as you may recall from logistic regression.

One residual

Step 1: We begin with a single leaf implies that the model will be initialized with a constant.

The above equation means that we sum all the residuals (loss for each observation). However, argmin over gamma means that we must make such a prediction that as a result this sum is minimized. As a result of the math, the best-predicted value is the average of all the y values in the first round.

Dummy data

Imagine that we have a dummy dataset and target feature as above. In this case, our initial prediction will be the average = 75.

Step 2: for m=1 to M; we will make M trees in a loop.

Step 2-A: compute residuals:

Take the derivative of the loss function (this derivative is the Gradient);

Step 2-B: In this step, we will build a base learner (decision tree in our case). Our target feature will not be the target column, instead, it will be the residuals.

Step 2-C: For each leaf, compute the gamma value that minimizes the summation below; it takes the previous prediction into account and considers the samples in leaves. What I mean here is that we will calculate the gamma value for each leaf separately and whatever observations are on the leaf in question will be included in the calculation.

Step 2-D: Update the predictions:

Let’s explain this formula as follows: for each sample, we add the previous prediction value and the gamma values we found in the previous step. The gamma value is multiplied by the learning rate (to avoid over-fitting).

For example, let’s make up the values found in the previous step (learning rate = 0.1);

pred2,1 = 75 * 0.1 (-22) = 72.8

In this way, we have taken a small step towards a better result. We repeat these steps M times.

Python Code

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as pt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
#READ DATA
data = pd.read_csv("housing.csv")
data.fillna(0,inplace=True)
#X,y
X = data.iloc[:,:-2]
y = data['median_house_value']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=100)
#scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
y_train = np.array(y_train).reshape(X_train.shape[0],1)
y_test = np.array(y_test).reshape(X_test.shape[0],1)
#TRAIN
G = GradientBooster()
models, losses, pred_0 = G.train(X_train,y_train)

Let’s plot losses;

sns.set_style('darkgrid')
ax = sns.lineplot(x=range(1000),y=losses)
ax.set(xlabel='Epoch',ylabel='Loss',title='Loss vs Epoch')

Now let’s predict the test data;

y_pred = G.predict(models, y_train, X_test)
print('RMSE:',np.sqrt(mean_squared_error(y_test,y_pred)))
#RMSE: 49396.079511786884

Sklearn

from sklearn.ensemble import GradientBoostingRegressormodel = GradientBoostingRegressor(n_estimators=1000,criterion='mse',
max_depth=8,min_samples_split=5,
min_samples_leaf=5,max_features=3)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
print('RMSE:',np.sqrt(mean_squared_error(y_test,y_pred)))#RMSE: 48744.67210701868

RMSE result is aligned with the manual implementation.

Thanks for reading. If you have any questions or comments, please feel free to write me!

Read More…

References

https://en.wikipedia.org/wiki/Gradient_boosting

https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/

https://www.youtube.com/watch?v=3CC4N4z3GJc&t=311s

--

--