Gradient Boosting for Regression from Scratch

Explanation and Implementation of Gradient Boosting in Python

5 min readJul 29, 2022

Gradient boosting is one of the ensemble machine learning techniques. It uses weak learners like the others in a sequence to produce a robust model.

It is a flexible and powerful technique that can be used for both regression and classification problems. Even with minimal tuning, good results can be achieved. It can handle a large number of features and is not biased towards any particular feature type.

On the other hand, it is more sensitive to overfitting than other machine learning methods and can be slow to train, especially on large datasets.

Despite its disadvantages, gradient boosting is a popular method for many machine learning tasks, due to its flexibility, power, and relatively good performance.

In this blog post, I will examine the application of the gradient boosting technique to regression problems. In another article, I will address the topic of classification problems.

Gradient Boosting for Classification

Explanation and Implementation of Gradient Boosting in Python

medium.com

Before proceeding, it might be beneficial to refresh your understanding of decision trees and another ensemble technique, AdaBoost.

AdaBoost from Scratch

AdaBoost Algorithm Explained and Implemented

medium.com

Decision Tree Parameter Explanations

Sklearn’s Decision Tree Parameter Explanations

python.plainenglish.io

As you may recall, AdaBoost used decision trees with a depth of 1 called a stump. Each new stump decreases or increases the weights of observations according to the previous stump’s error.

Gradient Boost, on the other hand, starts with a single leaf first, an initial guess. Later, it builds trees. However, unlike AdaBoost, these trees are usually larger than a stump. People usually use decision trees with 8 to 32 leaves in this technique. Again, unlike AdaBoost, the Gradient Boosting technique scales trees at the same rate.

Pseudocode

To begin with, we have a dataset of x, observations, and y, target features. In addition, we have a differentiable loss function. We use a loss function to evaluate our estimations.

The loss function is just the sum of squares of residuals as you may recall from logistic regression.

Step 1: We begin with a single leaf implies that the model will be initialized with a constant.

The above equation means that we sum all the residuals (loss for each observation). However, argmin over gamma means that we must make such a prediction that as a result this sum is minimized. As a result of the math, the best-predicted value is the average of all the y values in the first round.

Imagine that we have a dummy dataset and target feature as above. In this case, our initial prediction will be the average = 75.

Step 2: for m=1 to M; we will make M trees in a loop.

Step 2-A: compute residuals:

Take the derivative of the loss function (this derivative is the Gradient);

Step 2-B: In this step, we will build a base learner (decision tree in our case). Our target feature will not be the target column, instead, it will be the residuals.

Step 2-C: For each leaf, compute the gamma value that minimizes the summation below; it takes the previous prediction into account and considers the samples in leaves. What I mean here is that we will calculate the gamma value for each leaf separately and whatever observations are on the leaf in question will be included in the calculation.

Step 2-D: Update the predictions:

Let’s explain this formula as follows: for each sample, we add the previous prediction value and the gamma values we found in the previous step. The gamma value is multiplied by the learning rate (to avoid over-fitting).

For example, let’s make up the values found in the previous step (learning rate = 0.1);

pred2,1 = 75 * 0.1 (-22) = 72.8

In this way, we have taken a small step towards a better result. We repeat these steps M times.

Python Code

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as pt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score#READ DATA
data = pd.read_csv("housing.csv")
data.fillna(0,inplace=True)#X,y
X = data.iloc[:,:-2]
y = data['median_house_value']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=100)#scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)y_train = np.array(y_train).reshape(X_train.shape[0],1)
y_test = np.array(y_test).reshape(X_test.shape[0],1)#TRAIN
G = GradientBooster()
models, losses, pred_0 = G.train(X_train,y_train)

Let’s plot losses;

sns.set_style('darkgrid')
ax = sns.lineplot(x=range(1000),y=losses)
ax.set(xlabel='Epoch',ylabel='Loss',title='Loss vs Epoch')

Now let’s predict the test data;

y_pred = G.predict(models, y_train, X_test)
print('RMSE:',np.sqrt(mean_squared_error(y_test,y_pred)))#RMSE: 49396.079511786884

Sklearn

from sklearn.ensemble import GradientBoostingRegressormodel = GradientBoostingRegressor(n_estimators=1000,criterion='mse',
                                    max_depth=8,min_samples_split=5,
                                    min_samples_leaf=5,max_features=3)model.fit(X_train,y_train)
y_pred = model.predict(X_test)print('RMSE:',np.sqrt(mean_squared_error(y_test,y_pred)))#RMSE: 48744.67210701868

RMSE result is aligned with the manual implementation.

Thanks for reading. If you have any questions or comments, please feel free to write me!

Gradient Boosting for Classification

Explanation and Implementation of Gradient Boosting in Python

medium.com

Random Forest Implementation

Sklearn Implementation and Parameter Explanations

python.plainenglish.io

A Detailed Catalog of Dimensionality Reduction

Many Dimensionality Reduction Methods Explained in Python Language

medium.com

Scatter Charts (Matplotlib, Seaborn, Plotly)

How to create scatter charts using the 3 most popular libraries?

towardsdev.com

Ridge, Lasso & ElasticNet Regressions From Scratch

Python Code from Scratch and Sklearn Implementation

python.plainenglish.io

References

https://en.wikipedia.org/wiki/Gradient_boosting

https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/

https://www.youtube.com/watch?v=3CC4N4z3GJc&t=311s

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

Gradient Boosting for Regression from Scratch

Explanation and Implementation of Gradient Boosting in Python

Gradient Boosting for Classification

Explanation and Implementation of Gradient Boosting in Python

AdaBoost from Scratch

AdaBoost Algorithm Explained and Implemented

Decision Tree Parameter Explanations

Sklearn’s Decision Tree Parameter Explanations

Pseudocode

Python Code

Sklearn

Read More…

Gradient Boosting for Classification

Explanation and Implementation of Gradient Boosting in Python

Random Forest Implementation

Sklearn Implementation and Parameter Explanations

A Detailed Catalog of Dimensionality Reduction

Many Dimensionality Reduction Methods Explained in Python Language

Scatter Charts (Matplotlib, Seaborn, Plotly)

How to create scatter charts using the 3 most popular libraries?

Ridge, Lasso & ElasticNet Regressions From Scratch

Python Code from Scratch and Sklearn Implementation

References

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Okan Yenigün