AdaBoost vs. Gradient boosting (Classification) in Python

11 min readJul 3, 2022

Introduction

In the previous article, we talked about how to use AdaBoost to build a boosting ensemble model. Generally speaking, the weight of each sample is modified during each iteration to reduce the prediction error, and the weight of each tree is different when making final classification.

AdaBoost is the first boosting algorithm introduced in 1995 by Freund and Schapire. Nevertheless, Gradient Boosting is a more robust and flexible boosting tree algorithm compared to Adaboost.

In this article, we’ll first talk about the workflow of Gradient Boosting. Moreover, we’ll use Titanic dataset as an example for illustration and comparison.

⚡ The code will be provided in the last section of this article.

Gradient Boosting

First of all, gradient boosting can be used in regression, binary classification, and multi-class classification (One-vs-all). In this article, we’ll focus on gradient boosting in Binary classification.

In particular, gradient boosting classifier applies the concepts of logistic regression. It uses log-odds to make a prediction, converts log-odds to probabilities through logistic function, then make a classification based on self-defined threshold.

Say we want to predict whether a person is obese (binary classification). The outcome “Obese” is represented by 1; “Not obese” is represented by 0. Given the threshold 0.5, if a probability of a person being obese is 0.7, he/she will be classified as “Obese” (0.7 > threshold 0.5).

Second, gradient boosting, just like AdaBoost, is a boosting algorithm that starts with a Weak learner (decision tree), builds new learners based on the Error of previous models, and makes prediction based on All the learners.

Before we look at the workflow of gradient boosting, let’s discuss the difference between gradient boosting and AdaBoost.

To illustrate, gradient boosting is more flexible in terms of the loss functions, tree depth, and the usage. For the algorithms, AdaBoost learns from the misclassified samples; gradient boosting learns from samples with large pseudo-residuals. Moreover, each tree in gradient boosting is weighted equally when making final classification.

Alright, let’s dive into Gradient boosting! Suppose we want to predict whether a person is obese (binary classification) using his/her height, age, Cholesterol levels, and gender. In specific, we have 10 people in the training set — 3 of them are obese, the remaining 7 people are not obese.

1. Assigns equal log-odds to samples

As in AdaBoost, we need to make an initial guess. In gradient boosting, we use log-odds. In fact, log-odd is simply the log(number of positive cases/number of negative cases).

In this example, we have 3 obese people (positive cases), 7 non-obese people (negative cases). Thus, the log-odd is log(7/3) : 0.847. Then, we assign this log-odd to each person in the training set.

⚡ The log-odds are NOT the predicted probabilities.

2. Transforms log-odds back to probabilities

As in logistic regression, we need to convert log-odds back to probabilities using logistic functions. The formula goes like this.

Given log-odd 0.847, the probability of being obese is 0.70. Since we assign the same log-odd to everyone in the training set, they all have predicted probability (of being obese) 0.70.

3. Calculate Pseudo-residual for each sample

Pseudo-residuals are nothing special but the intermediate error term that the predicted values are temporary/intermediate. Hence, pseudo-residual is calculated by (Actual value - Predicted value).

In this example, 3 people are obese (represented by 1), so their pseudo-residuals are (1–0.7) : 0.3. The remaining 7 people are not obese (represented by 0), so their pseudo-residuals are (0–0.7): -0.7.

4. Build a decision tree based on Pseudo-residuals

Then, we can build our first tree. As in normal decision tree, we split on the attributes with the best quality in terms of Friedman mean squared error or normal mean squared error on Pseudo-residuals. By convention, we typically limit the number of leaves to be 8 to 32.

⚡ The target attribute is the pseudo-residuals!

5. Compute the output value of each node in log-odds

As a matter of fact, we use residuals of Probabilities to build a tree and the leaves. However, the predictions are in Log-odds (step1). Hence, we need to transform the output value of each node in log-odds, and the most commonly-used formula goes like this.

Say we have a leaf containing 2 obese people and 1 non-obese person. The residual of obese person is 0.3; the residual of non-obese person is -0.7 (step3). The previous probabilities for everyone are 0.7 since we assign equal log-odds in step1. Therefore, the output value of this leaf is (0.3 + 0.3 + (-0.7)) / [(0.7 * (1–0.7)) + (0.7 * (1–0.7)) + (0.7 * (1–0.7))], which is -0.159.

⚡ The previous probabilities for different samples should be different. They are the same here because it’s our First tree.

6. Updates the previous predicted log-odds

The output values we calculated in step5 is used to update predicted log-odds (step1). The formula goes like this.

As in AdaBoost, we can assign a learning rate. If the learning rate is high, the new tree will learn a lot from the previous tree, and the probability of overfitting will increase.

Nevertheless, if the learning rate is low, the tree will improve in a slow manner, and we will need more trees to achieve high accuracy. Therefore, we often consider learning rate and the number of trees as trade-off.

Given learning rate 0.1, the new log-odd of a person who belongs to the example leaf in step5 will be 0.847 (previous log-odd) + 0.1 * -0.159 (output value of tree 1), which is 0.8311.

⚡ We take a baby step at a time by setting a low learning rate. Even if the direction of change is not right now (i.e., an obese people has a lower log-odds of being obese), we will be able to correct this as the number of iteration increases.

7. Repeat Step2 to Step6

Now we have a new set of log-odds, we need to convert them back to probabilities before building a tree. As a result, we repeat step2 to step6 until the required number of trees are built.

⚡ We can terminate the training earlier when an extra tree doesn’t improve the performance on validation set— Mitigate the issue of Overfitting.

Before we move on, how does gradient boosting make classification exactly?

First, we use the formula in step6 to calculate a log-odd for new sample. Each tree has equal weight (same learning rate), and the prediction is based on the initial log-odd (step1).

Then, we convert the log-odd back to probability using the formula in step2 and compare this probability with our threshold!

If the log-odd of a person is 0.6, the predicted probability of that person being obese is 0646. Given the threshold 0.6, that person will be predicted as “Obese” since 0.646 is higher than 0.6.

That’s it! We’ve learned the workflow of gradient boosting, and we can use GradientBoostingClassifier in Python. Moreover, there are a lot of parameters that can be tuned (We’ll only talk about a few of them).

First, gradient boosting have most of the parameters in decision tree — criterion, max_depth, max_leaf_nodes, and more! Setting these parameters defines how each tree in the forest is trained, and you can find the detailed explanation of these parameters here.

⚡ The criterion in decision tree can be “gini”, “entropy”, or “log_loss”, while the criterion in gradient boosting is either ‘friedman_mse’ or ‘squared_error’.

Second, gradient boosting has a lot more parameters, including loss, learning_rate, and n_estimators.

loss: The loss function to be optimized through gradient descent. In fact, the machine learning algorithm is trained to minimize the loss function.
n_estimators: Number of trees in the gradient boosting, which is 100 by default.
learning_rate: Determines the shrinkage of the contribution of a tree at each boosting iteration. When the learning rate is high, gradient boosting tends to overfit the data; when the learning rate is low, we need more iterations to increase accuracy. Thus, n_estimators and learning_rate can be viewed as a trade-off.

Example

Let’s move on to the Titanic example. To quickly recap, we want to predict whether a passenger will survive in Titanic event. We pre-processed the dataset before and selected the independent variables we need X — Pclass, Sex, Age, SibSp, Parch, Fare, as well as the dependent variable y— Survived. The dataset looks like this —

Next, we split the data into training and testing set in 80/20 ratio. Namely, we use 80% of data to train the model, 20% of data to evaluate the model.

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=65)

Further, we compared the performance of random forest and AdaBoost, and we found that AdaBoost achieved the highest accuracy score when learning_rate is 1.1 (see this article).

Now, we want to find out whether gradient boosting can outperform AdaBoost. For comparison, I’ll leave all the parameters as default, except for the learning_rate. By doing so, we can easily see how the gradient boosting performs compared with AdaBoost through line plot.

First, we create a variable learning_rate_range to store the learning rate we want to try — from 0.1 to 2 with interval 0.1. Then, we store the training and testing score of AdaBoost in the list train_ada and test_ada; the training and testing score of Gradient Boosting in the list train_gb and test_gb.

# Stores the accuracy score 
# AdaBoost
learning_rate_range = np.arange(0.1, 2, 0.1)
test_ada = [] 
train_ada = []
for lr in learning_rate_range:
    ada= AdaBoostClassifier(learning_rate=lr, n_estimators=100)
    ada.fit(x_train, y_train)
    train_ada.append(ada.score(x_train, y_train))
    test_ada.append(ada.score(x_test, y_test))# Gradient boosting
test_gb = [] 
train_gb = []
for lr in learning_rate_range:
    gb= GradientBoostingClassifier(learning_rate=lr, n_estimators=100)
    gb.fit(x_train, y_train)
    train_gb.append(gb.score(x_train, y_train))
    test_gb.append(gb.score(x_test, y_test))

Then, we can draw a line plot to compare the model performance.

# AdaBoost vs. Gradient Boosting
fig = plt.figure(figsize=(8, 6))
plt.plot(learning_rate_range, test_gb, c='red', label='Gradient boosting')
plt.plot(learning_rate_range, test_ada, c='blue', label='AdaBoost')
plt.xlabel('Learning rate')
plt.xticks(learning_rate_range)
plt.ylabel('Accuracy score')
plt.ylim(0.6, 1)
plt.legend(prop={'size': 12})
plt.title('Accuracy score vs. Learning rate of AdaBoost/Gradient Boosting', size=14)
plt.show()

As you can see, gradient boosting has the best model performance (Accuracy 0.839) when learning rate is 0.2, which is higher than the best performance of AdaBoost (Accuracy 0.825). This result is unsurprised since gradient boosting is essentially a more robust algorithm and should have higher accuracy.

However, that’s not the end of the story. We need to check whether the model overfits since boosting algorithms are prone to overfitting.

# Figure size
fig = plt.figure(figsize=(15, 5))# AdaBoost
fig.add_subplot(121)
plt.plot(learning_rate_range, train_ada, c='orange', label='Training')
plt.plot(learning_rate_range, test_ada, c='m', label='Testing')
plt.xlabel('Learning rate')
plt.xticks(learning_rate_range)
plt.ylabel('Accuracy score')
plt.ylim(0.6, 1)
plt.legend(prop={'size': 12})
plt.title('Accuracy score vs. Learning rate of AdaBoost', size=16)# Gradient boosting
fig.add_subplot(122)
plt.plot(learning_rate_range, train_gb, c='orange', label='Training')
plt.plot(learning_rate_range, test_gb, c='m', label='Testing')
plt.xlabel('Learning rate')
plt.xticks(learning_rate_range)
plt.ylabel('Accuracy score')
plt.ylim(0.6, 1)
plt.legend(prop={'size': 12})
plt.title('Accuracy score vs. Learning rate of Gradient Boosting', size=16)
plt.show()

Wow. Look at how overfitting the gradient boosting is.

Overfitting is a problem that the model learns too much irrelevant detail from the training data, and its performance on the unseen data will be Unstable. In other words, even if we find the model performs well on this particular testing data, we cannot be sure that it will keep performing this way.

AdaBoost, on the other hand, has almost no overfitting issue when learning rate is 0.4. In this case, even if the accuracy score of AdaBoost is lower than that of gradient boosting, we’d still prefer AdaBoost since we can TRUST it constantly. At the end of the day, we want to find a stable model we truly understand, instead of the uncertain ones with occasionally high performance.

Having said that, we are not ready to give up on gradient boosting. To reduce the problem of overfitting, we can Regularize the model. When we impose a regularization on a model, we restricts its capability.

Generally speaking, we can reduce the number of iterations (n_estimators) and learning rate (learning_rate), or increase minimum number of samples in a leaf (min_samples_leaf) and the tolerance for the early stopping (tol), so the model can’t learn so well.

In specific, increasing the tolerance for the early stopping is essentially decreasing the number of iterations. When a new tree doesn’t improve the previous model by certain degree (tol), we terminate the training earlier.

In this example, we’ll only tune the n_estimators to 50 and grid-search the optimal min_samples_leaf to reduce the issue of overfitting. In the real world, we can use grid search and K-fold cross validation to find the best combination of parameters (see this article).

Further, we can tell from the line plot that the performance of gradient boosting generally decreases after learning rate 0.7. This implies the learning rate is too high for a model to steadily capture the general trend of our dataset. Therefore, we change the variable learning_rate_range to — from 0.01 to 0.7 with interval 0.05.

learning_rate_range = np.arange(0.01, 0.7, 0.05)
fig = plt.figure(figsize=(14, 12))
idx = 1 # index of figure configuration
for leaf in range(5, 45, 10):
    train = []
    test = []
    for lr in learning_rate_range:
        gb= GradientBoostingClassifier(learning_rate=lr, n_estimators=50, min_samples_leaf=leaf)
        gb.fit(x_train, y_train)
        train.append(gb.score(x_train, y_train))
        test.append(gb.score(x_test, y_test))
    fig.add_subplot(2, 2, idx)
    idx += 1
    plt.plot(learning_rate_range, train, c='orange', label='Training')
    plt.plot(learning_rate_range, test, c='m', label='Testing')
    plt.xlabel('Learning rate')
    plt.xticks(learning_rate_range)
    plt.ylabel('Accuracy score')
    plt.ylim(0.6, 1)
    plt.legend(prop={'size': 12})
    title = "Leaf:" + str(leaf)
    plt.title(title, size=16)
plt.show()

Yes, I know. It’s not perfect. We still have the issue of overfitting, but the problem is mitigated by a noticeable degree. In particular, when the min_samples_leaf is 25, the model performance seems acceptable and the issue of overfitting can be considered mild.

Let’s do the comparison again using new learning_rate_range (np.arange(0.01, 0.7, 0.05)) and new parameters (min_samples_leaf=25/ n_estimators=50)!

As you can see, gradient boosting outperforms AdaBoost most of time, and it has the highest performance when learning rate is 0.46. This example shows the power of gradient boosting in terms of its Flexibility, not to mention that we left a lot of parameters as default.

However, we should always pay extra attention to the issue of overfitting. Gradient boosting is indeed a robust algorithm, but that comes with a cost. It can learn the training data too well and too detailed and fail to capture the general trend. Luckily, we can always tune the parameters to restrict its learning ability until we find the degree of overfitting acceptable.

💛 If you like this article, make sure to follow me! It really encourages me and motivates me to keep sharing. Thank you so much.

References

Coding