Introduction to XGBoost

Published in

Apprentice Journal

6 min readFeb 18, 2019

Boosting is an approach used to increase the predictive power of classical decision and regression tree models through an iterative learning algorithmic technique that has a higher focus on observations that are misclassified. In other words, it is a family of algorithms that convert weak learners to strong learners. Another ensemble technique that is often compared to boosting is bagging or Bootstrap Aggregation. As an ensemble technique, bagging also creates a set of week learners and combines them to create a strong learner. However, bagging creates subsets of data from training samples chosen at random, rather than based on performance like with boosting. Boosting generates a model with lower errors and reduces biases, whereas bagging’s goal is to minimize variance.

The boosting algorithms follow this pattern:

The base learner takes all the distributions and assigns equal weight to each observation.
If there is a prediction error, boosting pays more attention to these observations when applying the next base learning algorithm
Repeat Step 2 until accuracy lthe limit of accuracy is achieved for each observation
After all base learning iterations have been completed, the outputs are combined to create a strong learner

XGBoost enhances the typical fboosting procedure by introducing gradient descent. Gradient boosting uses a gradient descent algorithm when iterating over the predictive errors to create a strong learner. That is, rather than fitting the iterative model on residuals, it fits on the gradients of the loss or cost function. XGBoosting, specifically, is such a powerful member of the gradient boosting family for its fast computation power, and therefore scalability, and strong performance abilities. In addition, XGBoosting handles missing and weighted data better than any other tree-based algorithm. As a result, XGBoost has become one of the most popular machine learning techniques.

It should be noted before going further that boosting techniques do not look to learn or predict on their own, they are simply used to add muscle to your original model.

XGBoost in Python

XGBoost is part of the xgboost package in python.

import xgboost as xgb

Hyperparameters

There are numerous hyperparameters within the xgboost command function. However, some of these hyperparameters are more essential and/or useful when tuning your model than others. The list below goes into a little more detail on these specific hyperparameters.

General Hyperparameters:

booster: Used to define which type of booster you want to use: a tree-based model (gbtree or dart) or linear functions (gblinear).

Tree Booster Hyperparameters:

learning_rate (eta): The learning rate uses the weights of new features to shrink the feature weights to help prevent overfitting. The range is set between [0,1], with a default of 0.3
min_split_loss (gamma): Controls whether a given node will split based on the expected reduction in loss after the split. A higher value leads to fewer splits.
max_depth: Determines how deep each tree is allowed to grow during any boosting round. As the value increase, the higher the probability of overfitting.
subsample: The percentage of samples used per training iteration. Too low subsampling will result in underfitting, while too high of subsampling may result in overfitting.
colsample_bytree: The percentage of your total features used per tree. Using too many features can lead to overfitting.
n_estimators: The total number of trees you want to build.

Linear Booster Hyperparameters:

alpha: L1 regulation term on weights. A large value leads to more regularization. The model becomes more conservative (leads to more regularization) as the value increases.
lambda: L2 regularization term on weights, and is smoother than L1 regularization. The model becomes more conservative (leads to more regularization) as the value increases.

Adjusting these hyperparameters will have an impact on the overall performance of your model. Note, be careful when setting your hyperparameters as to not over your model.

Basic Example using XGBoost

Now let’s look at a basic example of how we can use this powerful technique.

Before beginning the boosting processes, we first need to split our data into a training and testing set. The sklearn package has a splitting feature we can use:

X: The input features
y: The output target variable. (This is your variable of interest).
test_size: The portion of your data that the test set should contain. In the example below we designate 20% of the overall data into the test set
random_state: Used for repeatability

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)

Now that we have our training and testing set, lets run our XGBoost algorithm on our training set. XGBClassifier() is one of the functions used within the xgboost package to create a model. Within XBGClassifer is where you will set your hyperparameters for the model. Below are a few example hyperparameter settings.

Once you have specified the model, running the .fit attribute within XGBClassifer will fit the training set on the XGBoost algorithm, using the hyperparameters you have specified.

It may be beneficial to run the model with the defaults set to get a base performance scoring to adjust your model off of.

from xgboost import XGBClassifier
model = xgb.XGBClassifier(learning_rate = 0.01,
													colsample_bytree = 0.4, 
													subsample = 0.8, 
													n_estimators = 1000,
													max_depth = 4)
model.fit(X_train, y_train)

Now that we have our model, let us put it to the test and run some predictions on our testing set.

y_pred = model.predict(X_test)

Results and Evaluation

There is no one correct way of setting each hyperparameter. There may be best practices, but it is up to you as the data scientist to test what hyperparameters will be optimal for the data within your model. Thus, after running the XGBoost model on a testing set and gathering predictive outcomes, evaluating and understanding the performance of the model is a crucial next step.

A good starting point in checking the model’s performance is testing its accuracy.

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)

However, looking exclusively at the accuracy of the model will not tell the whole story of performance. Using the sklearn package, you can evaluate your model’s ability to predict the true positives (TP/(TP+FN)), also known as the sensitivity rate, and the true negatives (TN/(TN+FP)), also known as the specificity rate through a confusion matrix.

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

Example Confusion Matrix

Your model could, for example, have very high accuracy and sensitivity, but have a very low specificity. It is important to under the most about your model as you can in order to adjust your hyperparameters and develop the strongest model.

In addition to comparing the performance within the XGBoost model through adjusting hyperparameters, it can also be beneficial to see how different types of models perform. Below we compare the accuracy scores of multiple different models using the SDSS (Sloan Digital Sky Survey) data example from Kaggle. The goal of this analysis is to classify observations of space as stars, galaxies or quasars. We can see just how much stronger XGBoost scores in accuracy above the other models.

Accuracy Comparison

Once you have a fitted model it is also interesting to evaluate which features are driving the predictive power of your model. Using feature_importance_ within sklearn will output a ranked table of your features. Using plot_importance within xgboost, will also allow you to plot out and visualize those features.

from sklearn.metrics import feature_importance_
features = model.feature_importance_from xgboost import plot_importance
import matplotlib.pyplot as plt
plot_importance(model)
plt.show()

Conclusion

We have only just begun to scratch the surface on the capabilities of XGBoost. There are other models within the xbgboost package to utilize depending on the relationship and makeup of your data. I invite you to further explore those with your new better understanding of XGBoosting’s capabilities.

XGBoosting is just one of many boosting algorithms. AdaBoost (Adaptive Boosting) was the boosting algorithm that started it all. The adaptive attribute of AdaBoost is its ability to tweak its weak learner algorithms in favor of those instances misclassified by previous instances. Other boosting techniques include LPBoost (Linear Programming Boosting) which maximizes the margins between training samples of the produced linear combinations, BrownBoost which ‘gives up’ on examples that are repeatedly misclassified and denotes them as noise and will not contribute to the final classifier, and LogitBoost, which applies the AdaBoost model to logit regression models, to derive the LogitBoost algorithm.

XGBoost, and boosting in general, is a powerful machine learning tool to utilize. It’s speed and base predictive algorithm make it one of the most powerful boosting techniques. That being said, there is an art and science to machine learning, as there is no one set way to developing the strongest models. So use your powers wisely.

References:

https://xgboost.readthedocs.io/en/latest/parameter.html

https://www.datacamp.com/community/tutorials/xgboost-in-python

https://en.wikipedia.org/wiki/Boosting_(machine_learning)

https://www.kaggle.com/lucidlenn/data-analysis-and-classification-using-xgboost