XGBoost (Classification) in Python

11 min readJul 6, 2022

Introduction

In the previous articles, we introduced Decision tree, compared decision tree with Random forest, compared random forest with AdaBoost, and compared AdaBoost with Gradient boosting. It has been quite a journey. Unfortunately, everything has an end.

This article will end the tree algorithm series. In particular, we’ll first look at XGBoost, which stands for Extreme gradient boosting. As the name suggests, XGBoost is an extreme and advanced version of gradient boosting that it is more flexible (i.e., regularization parameters) and more efficient (i.e., parallel computation). Moreover, we’ll use Titanic dataset as an example for illustration.

⚡ The code will be provided in the last section of this article.

XGBoost

First of all, XGBoost can be used in regression, binary classification, and multi-class classification (One-vs-all). In this article, we’ll focus on Binary classification.

Second, XGBoost and gradient boosting have a lot in common, including the log-odds and logistic functions. If you are unfamiliar with these concepts, go check out this article or this video (StatQuest).

So, without further ado, let’s dive into XGBoost! Suppose we want to predict whether a person is obese (binary classification) using his/her Gender and Cholesterol level. In specific, we have 10 people in the training set — 3 of them are obese, the remaining 7 people are not obese.

💦 Friendly reminder — The process might seem to be tediously long, but the concept is actually simple and straight-forward!

1. Assigns same initial predictions (0.5) to samples

As a start, we assign initial predictions to each sample. Since we haven’t trained any model or learned anything from the data, we can simply set 0.5 as initial prediction for every sample. Namely, a person has 50% chance being obese.

⚡ 0.5 is the default value in XGBoost module. We can change it to whatever value we like as long as it’s in the range [0, 1].

2. Calculate Pseudo-residual for each sample

Pseudo-residuals are nothing special but the intermediate error term that the predicted values are temporary/intermediate. Hence, pseudo-residual is calculated by (Actual value — Predicted value).

In this example, 3 people are obese (represented by 1), so their pseudo-residuals are (1–0.5) : 0.5. The remaining 7 people are not obese (represented by 0), so their pseudo-residuals are (0–0.5): -0.5.

⚡ The predicted values of all the samples are the same since it’s our FIRST tree. After the first iteration, the predicted values are likely to be different.

3. Build a decision tree based on Pseudo-residuals

Now, we can build our first tree. Like normal decision tree, we split on the attributes with the best quality. Unlike normal decision tree, the quality is NOT measured by mean squared error on pseudo-residuals, XGBoost has a unique method to choose attribute.

To be more specific, we calculate a Similarity score for each node. Then, the Gain of a node is the similarity score of its left child node + the similarity score of its right child node - the similarity score of itself. Finally, we select the split with highest gain.

I know it sounds pretty abstract, let’s look at the formulas to get a more concrete idea.

The similarity score formula of XGBoost is similar to the formula used to calculate output leaf value in gradient boosting. However, we take the square of the summation of residuals.

⚡ We sum up the residuals before we take the square. Thus, the positive residuals and negative residuals will be cancelled out!

Then, we have a lambda in the denominator, which is a regularization parameter. You might already know that boosting algorithms are prone to overfitting, and XGBoost is no exception. Therefore, we need to limit the model’s ability to learn. By adding lambda, both the similarity score and gain will be lower, which leads to our next step — Tree pruning.

4. Prune the tree — Both leaves and nodes

To reduce the issue of overfitting, we can set our own pruning thresholds. Let’s start with the node pruning. γ (gamma) in XGBoost is the minimum acceptable node gain. If a node has lower gain than γ, that node will be pruned and the tree will be shallower. Hence, if we set a high γ or add a large λ when calculating similarity scores, the tree will be more limited in depth.

Moreover, XGBoost can prune the leaves using Cover. We can think of cover as the minimum child node weight allowed. The formula of cover is simply the denominator of similarity score without λ— Σ [Prob * (1-Prob)]. If only 1 child node has higher weight than cover, we will still remove both leaves since the split is only legit when 2 leaves satisfy the cover restraint.

⚡ We prune the tree from bottom to top. If a lower node has higher gain than γ, we stop. Even if the top node has lower gain than γ, we don’t prune it.

5. Compute the output value of each node in log-odds

As a matter of fact, we use residuals of Probabilities to build a tree and the leaves. However, the predictions should be in Log-odds (inherits the concept of logistic regression). Hence, we need to transform the output value of each node in log-odds, and the most commonly-used formula goes like this.

⚡ The formula is almost exactly the same as in gradient boosting, except that we now have an additional regularization parameter — λ.

As before, the λ here reduces prediction’s sensitivity to isolated observations (see StatQuest), and reduce the chance of overfitting. Say the λ is 1 and we have a leaf containing 2 obese people and 1 non-obese person. The residual of obese person is 0.5; the residual of non-obese person is -0.5 (step2). The previous probabilities for everyone are 0.5 since we assign equal predictions in step1. Therefore, the output value of this leaf is (0.5 + 0.5 + (-0.5)) / [(0.5 * (1–0.5)) + (0.5 * (1–0.5)) + (0.5 * (1–0.5)) + 1], which is 0.286.

⚡ The previous probabilities for different samples should be different. They should only be the same in the First tree.

6. Updates the previous predicted probabilities

The output values we calculated in step5 is used to update predicted probabilities (step1). The formula goes like this.

As in gradient boosting, we can assign a learning rate. Well, in XGBoost, the learning rate is called eta.

If the eta is high, the new tree will learn a lot from the previous tree, and the probability of overfitting will increase. Nevertheless, if the eta is low, the tree will improve in a slow manner, and we will need more trees to achieve high accuracy. Therefore, eta is also a regularization parameter that is used to prevent the algorithm from overfitting.

Before we plug in the formula, we have one more thing to do. In step1, we predicted that every person has 50% chance being obese, but that’s not the log-odd. We need to convert the predicted probability to log-odds, and it can be calculated by log(p/(1-p)).

In specific, the log-odd of probability 0.5 is log(0.5/(1–0.5)), which is 0. Therefore, given eta 0.3, the new log-odd of a person who belongs to the example leaf in step5 will be 0 (previous log-odd) + 0.3 * 0.286(output value of tree 1), which is 0.086.

⚡ We take a baby step at a time by setting a low eta. Even if the direction of change is not right now (i.e., an obese people has a lower log-odds of being obese), we will be able to correct this as the number of iteration increases.

7. Repeat Step2 to Step6

Now we have a new set of log-odds, we need to convert them back to probabilities before building the second tree, and we can use this formula.

Afterwards, we repeat step2 to step6 until the required number of trees are built or the residuals are small (the predicted values are super close to the actual values).

Before we move on, how does XGBoost make classification exactly?

First, we use the formula in step6 to calculate a log-odd for new sample. Then, we convert the log-odd back to probability using the formula in step7 and compare this probability with our threshold!

If the log-odd of a person is 0.6, the predicted probability of that person being obese is 0646. Given the threshold 0.6, that person will be predicted as “Obese” since 0.646 is higher than 0.6.

That’s it! You might notice that I copy and paste content from gradient boosting, and yes, I did. In fact, gradient boosting and XGBoost has a lot in common, only that XGBoost is more flexible and more efficient.

Now we’ve learned the workflow of XGBoost, and we can use xgboost in Python. Furthermore, there are a LOT of parameters that can be tuned (We’ll only talk about a few of them).

max_depth (default — 6): The maximum depth of a tree.
n_estimators (default — 100): Number of trees in the XGBoost, which is 100 by default.
eta (default — 0.3): Learning rate; determines the shrinkage of the contribution of a tree at each boosting iteration. When the learning rate is high, XGBoost tends to overfit the data; when the learning rate is low, we need more iterations to increase accuracy. Thus, n_estimators and learning_rate can be viewed as a trade-off.
gamma (default — 0): Minimum gain required to split a node.
reg_lambda (default — 0): Regularization parameter used to reduces prediction’s sensitivity to isolated observations.

⚡ xgboost is NOT in sklearn library, it is an independent library. You need to install the library first before importing it.

# Install
!pip install xgboost# Import
import xgboost as xgb

Example

Let’s move on to the Titanic example. To quickly recap, we want to predict whether a passenger will survive in Titanic event. We pre-processed the dataset before and selected the independent variables we need X — Pclass, Sex, Age, SibSp, Parch, Fare, as well as the dependent variable y— Survived. The dataset looks like this —

Next, we split the data into training and testing set in 80/20 ratio. Namely, we use 80% of data to train the model, 20% of data to evaluate the model.

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=65)

Now, we want to find out how XGBoost performs. First, we create a variable learning_rate_range to store the learning rate we want to try — from 0.01 to 1 with interval 0.05. Then, we store the training and testing score of XGBoost in the list train_XG and test_XG.

# XGBoost (different learning rate)
learning_rate_range = np.arange(0.01, 1, 0.05)
test_XG = [] 
train_XG = []
for lr in learning_rate_range:
    xgb_classifier = xgb.XGBClassifier(eta = lr)
    xgb_classifier.fit(x_train, y_train)
    train_XG.append(xgb_classifier.score(x_train, y_train))
    test_XG.append(xgb_classifier.score(x_test, y_test))

Then, we can draw a line plot to see how XGBoost performs.

fig = plt.figure(figsize=(10, 7))
plt.plot(learning_rate_range, train_XG, c='orange', label='Train')
plt.plot(learning_rate_range, test_XG, c='m', label='Test')
plt.xlabel('Learning rate')
plt.xticks(learning_rate_range)
plt.ylabel('Accuracy score')
plt.ylim(0.6, 1)
plt.legend(prop={'size': 12}, loc=3)
plt.title('Accuracy score vs. Learning rate of XGBoost', size=14)
plt.show()

XGBoost did an okay job. It has 81.1% accuracy when learning rate is 0.11. However, the issue of overfitting is quite noticeable. Overfitting is a problem that the model learns too much irrelevant detail from the training data, and its performance on the unseen data will be Unstable. In other words, even if we find the model performs well on this particular testing data, we cannot be sure that it will keep performing this way.

Thus, we need to tackle this issue by Regularizing the model. When we impose a regularization on a model, we restricts its capability.

Generally speaking, we can reduce the number of iterations (n_estimators) and learning rate (learning_rate), or increase minimum gain required in a node (gamma) and the regularization parameter(reg_lambda), so the model can’t learn the feature of training set too well.

In this example, we’ll only tune the reg_lambda to 1 and grid-search the optimal min_child_weight to reduce the issue of overfitting. In the real world, we can use grid search and K-fold cross validation to find the best combination of parameters (see this article).

Also, we can see the training score is super close to 1 when the learning rate is higher than 0.5, while the testing score keep dropping, which implies the issue of overfitting. Thus, we change learning_rate_range to —from 0.01 to 0.5 with interval 0.05.

# new learning rate range
learning_rate_range = np.arange(0.01, 0.5, 0.05)
fig = plt.figure(figsize=(19, 17))
idx = 1
# grid search for min_child_weight
for weight in np.arange(0, 4.5, 0.5):
    train = []
    test = []
    for lr in learning_rate_range:
        xgb_classifier = xgb.XGBClassifier(eta = lr, reg_lambda=1, min_child_weight=weight)
        xgb_classifier.fit(x_train, y_train)
        train.append(xgb_classifier.score(x_train, y_train))
        test.append(xgb_classifier.score(x_test, y_test))
    fig.add_subplot(3, 3, idx)
    idx += 1
    plt.plot(learning_rate_range, train, c='orange', label='Training')
    plt.plot(learning_rate_range, test, c='m', label='Testing')
    plt.xlabel('Learning rate')
    plt.xticks(learning_rate_range)
    plt.ylabel('Accuracy score')
    plt.ylim(0.6, 1)
    plt.legend(prop={'size': 12}, loc=3)
    title = "Min child weight:" + str(weight)
    plt.title(title, size=16)
plt.show()

It works! As you can see, the overfitting issue is serious when min_child_weight is 0, while the training and testing score become generally closer as min_child_weight increases. Specifically, the overfitting issue seems to be minor when min_child_weight is 3.5, so let’s zoom in that graph.

Cool! Although we did not eliminate overfitting thoroughly, we did make the gapping between training score and testing score smaller. In particular, XGBoost reaches the best accuracy 0.846 when learning rate is 0.16.

Isn’t it fascinating that we reach better performance after we impose some regularization parameters?

Yes, that happens from time to time. Regularization limits the model’s capability to learn a specific trend of training set, NOT the general trend. Sometimes boosting model can learn better when it’s under regularization.

This example shows the power of XGBoost and its flexibility in terms of parameter tuning. Further, if you run the algorithm on your machine, you’ll find it’s actually fast due to its parallel computing nature.

However, we should always pay extra attention to the issue of overfitting. Boosting algorithm, including XGBoost, can learn the training data too well and too detailed and fail to capture the general trend. Luckily, we can always tune the parameters to restrict its learning ability until we find the degree of overfitting acceptable.

💛 If you like this article, make sure to follow me! It really encourages me and motivates me to keep sharing. Thank you so much.

References

Coding