XGBoost: A Deep Dive into Boosting

Published in

SFU Professional Computer Science

12 min readFeb 4, 2020

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit SFU PMP.

Authors: Shubham Malik, Rohan Harode and Akash Singh Kunwar

Every day we hear about new breakthroughs in Artificial Intelligence. These come in various fields of application. However, have you wondered what are the challenges which are solved to achieve these breakthroughs?

Artificial Intelligence faces challenges when it has to deal with unstructured data like DNA sequencing, credit card transaction and even in cybersecurity, which is the backbone of keeping our online presence safe. Curious? Do not worry! We’ve got you covered. In the cyber era, Machine Learning (ML), as a field of Artificial Intelligence, has provided us with the solutions to many problems considered unsolvable, including the ones mentioned above with the implementation of Gradient Boosting Machines (GBM). There are lots of algorithms readily available to perform gradient boosting; However, we would still encounter different issues like poor accuracy, high loss, large variance in the result depending on the application. Hence, we are going to introduce you to a state-of-the-art machine learning algorithm XGBoost built by Tianqi Chen, that will not only overcome the issues but also perform exceptionally well for regression and classification problems. This blog will help you discover the rationale, techniques, and skills with XGBoost that you can then bring to your machine learning projects.

XGBoost, at a glance!

eXtreme Gradient Boosting (XGBoost) is a scalable and improved version of the gradient boosting algorithm (terminology alert) designed for efficacy, computational speed and model performance. It is an open-source library and a part of the Distributed Machine Learning Community. XGBoost is a perfect blend of software and hardware capabilities designed to enhance existing boosting techniques with accuracy in the shortest amount of time. Here’s a quick look at an objective benchmark comparison of XGBoost with other gradient boosting algorithms trained on a random forest with 500 trees, performed by Szilard Pafka.

Benchmark Performance of XGBoost (source)

Too complex to comprehend? We have got your back! Let’s first understand the rudiments of machine learning methods which leads us to XGBoost and the vitality of its performance.

A quick flashback to Boosting:

Boosting generally means increasing performance. In ML, Boosting is a sequential ensemble learning technique (another terminology alert, fret not! we will explain this as well) to convert a weak hypothesis or weak learners into strong learners to increase the accuracy of the model.
We can understand the need for boosting through a simple classification example: To classify a Twitter account as Bot or Human with the help of underlying rules (limited in extent):

● No account information and profile photo → Bot
● Username is gibberish → Bot
● Tweeting in more than one language → Bot
● It has adequate amount of activity and a picture→ Human
● An enormous amount of tweets per day → Bot
● Other social media accounts linked→ Human

As now we are aware of the rules, let’s apply these on the below example:
Scenario — An account posting tweets in English and French.

The 3rd rule will classify it as a bot, but this will be a false prediction as a person can know and tweet in multiple languages. Consequently, based on a single rule, our prediction can be flawed. Since these individual rules are not strong enough to make an accurate prediction, they are called weak learners. Technically, a weak learner is a classifier that has a weak correlation with the actual value. Therefore, to make our predictions more accurate, we devise a model that combines weak learners’ predictions to make a strong learner, and this is done using the technique of boosting.

Weak rules are generated at each iteration by base learning algorithms which in our case can be of two types:
● Tree as base learner
● Linear base learner
Generally, decision trees are default base learners for boosting.

Interested to know more about learners? Be assured, we will demonstrate how to use both the base learners to train XGBoost models in a later section.

First, let’s get to know about the ensemble learning technique mentioned above.

Ensemble Learning:

Ensemble learning is a process in which decisions from multiple machine learning models are combined to reduce errors and improve prediction when compared to a Single ML model. Then the maximum voting technique is used on aggregated decisions (or predictions in machine learning jargon) to deduce the final prediction. Puzzled?

Think of it as organizing efficient routes to your work/college/or grocery stores. As you can use multiple routes to reach your destination, you tend to learn the traffic and the delay it might cause to you at different times of the day, allowing you to devise a perfect route, Ensemble learning is alike!

This image shows a clear distinction of a single ML model with respect to Ensemble Learner:

Single Model Prediction vs Ensemble Learner

Types of Ensemble learning:

Ensemble learning methods can be performed in two ways:
● Bagging (parallel ensemble)
● Boosting (sequential ensemble)
Though both have some fascinating math under the cover, we do not need to know it to be able to pick it up as a tool. Our focus will be more on the understanding of Boosting due to its relevance to XGBoost.

Working of boosting algorithm:

The boosting algorithm creates new weak learners (models) and sequentially combines their predictions to improve the overall performance of the model. For any incorrect prediction, larger weights are assigned to misclassified samples and lower ones to samples that are correctly classified. Weak learner models that perform better have higher weights in the final ensemble model. Boosting never changes the previous predictor and only corrects the next predictor by learning from mistakes. Since Boosting is greedy, it is recommended to set a stopping criterion such as model performance (early stopping) or several stages (e.g. depth of tree in tree-based learners) to prevent overfitting of training data. The first implementation of boosting was named AdaBoost (Adaptive Boosting).

Capital F(i) is current model, F(i-1) is previous model and small f(i) represents a weak model

Feeling confident? Take a look at two more algorithms (CART and Gradient Boosting) to understand the mechanics of XGBoost before we delve deeper into the topic.

Classification And Regression Trees (CART):

A decision tree is a supervised machine learning algorithm used for predictive modeling of a dependent variable (target) based on the input of several independent variables. It has a tree-like structure with the root at the top. CART which stands for Classification And Regression Trees is used as an umbrella term to refer to the following types of decision trees:

Classification Trees: where the target variable is fixed or categorical, this algorithm is used to identify the class/category within which the target would most likely fall.

Regression Trees: where the target variable is continuous and the tree/algorithm is used to predict its value, e.g. predicting the weather.

Gradient Boosting:

Gradient boosting is a special case of boosting algorithm where errors are minimized by a gradient descent algorithm and produce a model in the form of weak prediction models e.g. decision trees.

The major difference between boosting and gradient boosting is how both the algorithms update model (weak learners) from wrong predictions. Gradient boosting adjusts weights by the use of gradient (a direction in the loss function) using an algorithm called Gradient Descent, which iteratively optimizes the loss of the model by updating weights. Loss normally means the difference between the predicted value and actual value. For regression algorithms, we use MSE (Mean Squared Error) loss as an evaluation metric while for classification problems, we use logarithmic loss.

w represents the weight vector, η is the learning rate

Gradient Boosting Process:

Gradient boosting uses Additive Modeling in which a new decision tree is added one at a time to a model that minimizes the loss using gradient descent. Existing trees in the model remain untouched and thus slow down the rate of overfitting. The output of the new tree is combined with the output of existing trees until the loss is minimized below a threshold or specified limit of trees is reached.

Additive Modeling in mathematics is a breakdown of a function into the addition of N subfunctions. In statistical terms, it can be thought of as a regression model in which response y is the arithmetic sum of individual effects of predictor variables x.

XGBOOST in action

What makes XGBoost a go-to algorithm for winning Machine Learning and Kaggle competitions?

XGBOOST Features

Isn’t it interesting to see a single tool to handle all our boosting problems! Here are the features with details and how they are incorporated in XGBoost to make it robust.

Algorithm Enhancements:

Tree Pruning — Pruning is a machine learning technique to reduce the size of regression trees by replacing nodes that don’t contribute to improving classification on leaves. The idea of pruning a regression tree is to prevent overfitting of the training data. The most efficient method to do pruning is Cost Complexity or Weakest Link Pruning which internally uses mean square error, k-fold cross-validation and learning rate. XGBoost creates nodes (also called splits) up to max_depth specified and starts pruning from backward until the loss is below a threshold. Consider a split that has -3 loss and the subsequent node has +7 loss, XGBoost will not remove the split just by looking at one of the negative loss. It will compute the total loss (-3 + 7 = +4) and if it turns out to be positive it keeps both.
Sparsity Aware Split Finding — It is quite common that the data we gather has sparsity (a lot of missing or empty values) or becomes sparse after performing data engineering (feature encoding). To be aware of the sparsity patterns in the data, a default direction is assigned to each tree. XGBoost handles missing data by assigning them to default direction and finding the best imputation value so that it minimizes the training loss. Optimization here is to visit only missing values which make the algorithm run 50x faster than the naïve method.

System Enhancements:

Parallelization — Tree learning needs data in a sorted manner. To cut down the sorting costs, data is divided into compressed blocks (each column with corresponding feature value). XGBoost sorts each block parallelly using all available cores/threads of CPU. This optimization is valuable since a large number of nodes gets created frequently in a tree. In summary, XGBoost parallelizes the sequential process of generating trees.
Cache Aware — By cache-aware optimization, we store gradient statistics (direction and value) for each split node in an internal buffer of each thread and perform accumulation in a mini-batch manner. This helps to reduce the time overhead of immediate read/write operations and also prevent cache miss. Cache awareness is achieved by choosing the optimal size of the block (generally 2¹⁶).

Flexibility in XGBoost:

Customized Objective Function — An objective function intends to maximize or minimize something. In ML, we try to minimize the objective function which is a combination of the loss function and regularization term.

L(Φ) is objective function

Optimizing the loss function encourages predictive models whereas optimizing regularization leads to smaller variance and makes prediction stable. Different objective functions available in XGBoost are:
● reg: linear for regression
● reg: logistic, and binary: logistic for binary classification
● multi: softmax, and multi: softprob for multiclass classification

2. Customized Evaluation Metric — This is a metric used to monitor the model’s accuracy on validation data.
● rmse — Root mean squared error (Regression)
● mae — Mean absolute error (Regression)
● error — Binary classification error (Classification)
● logloss — Negative log-likelihood (Classification)
● auc — Area under the curve (Classification)

Cross-validation:

Built-in Cross-validation — Cross validation is a statistical method to evaluate machine learning models on unseen data. It comes in handy when the dataset is limited and prevents overfitting by not taking an independent sample (holdout) from training data for validation. By reducing the size of training data, we are compromising with the features and patterns hidden in the data which can further induce errors in our model. This is similar to cross_val_score functionality provided by the scikit-learn library.

XGBoost uses built-in cross validation function cv():

xgb.cv()

Want to get your feet wet?

DMatrix is an internal data structure used by XGBoost which is optimized for memory efficiency and training speed. We need to transform our numpy array of data using DMatrix so that it can be later utilized in dtrain and dtest parameters of its inbuilt functions.

2. k-fold Cross-validation — In k-fold cross-validation, data is shuffled and divided into k equal sized subsamples. One of the k subsamples is used as a test/validation set and remaining (k -1) subsamples are put together to be used as training data. Then we fit a model using training data and evaluate it using the test set. This process is repeated k times so that every data point stays in validation set exactly once. The k results from each model should be averaged to get the final estimation. The advantage of this method is that we significantly reduce bias, variance, and also increase the robustness of the model.

k-fold Cross validation using sklearn in XGBoost :

from sklearn.model_selection import KFold, cross_val_scorekfold = KFold(n_splits=15)
xgboost_score = cross_val_score(xg_cl, X, y, cv=kfold)

Tuning like a pro!

Model tuning in XGBoost can be implemented by cross-validation strategies like GridSearchCV and RandomizedSearchCV.

1. Grid Search — We pass on a parameter’s dictionary to the function and compare the cross-validation score for each combination of parameters (many to many) in the dictionary and return the set having the best parameters.

2. Random Search — We draw a random value during each iteration from the range of specified values for each hyperparameter searched over, and evaluate a model with those hyperparameters. After completing all iterations, it picks the hyperparameter configuration with the best score.

output using grid search

output using random search

Total configurations in Grid Search → 2*4*1*3*3 = 72
Total configurations in Random Search → 100*1*10*50 = 50000

Extendibility:

Classification using XGBoost

2. Regression using XGBoost:

2.1. Decision Tree Base Learning

variance from xgboost regression with decision tree as base learner

Tree plots using decision tree (XGBRegressor)

2.2. Linear Base Learning

rmse using xgboost regression with linear base learner

Plot Importance Module:

XGBoost library provides a built-in function to plot features ordered by their importance. The function is plot_importance(model) and it takes the trained model as its parameter. The function gives an informative bar chart representing the significance of each feature and names them according to their index in the dataset. The importance is calculated based on an importance_type variable which takes the parameters
weights (default) — tells the times a feature appears in a tree
gain — is the average training loss gained when using a feature
cover — which tells the coverage of splits, i.e. the number of times a feature is used weighted by the total training point that falls in that branch.

xgb.plot_importance(model)

Feature importance analysis on housing dataset

____________________________________________

Thanks for reading this article!

We hope that the post could help you understand the fundamentals behind XGBoost Algorithm and gave in-depth insights. Let us know if you have any questions or comments. We look forward to hearing your suggestions and feedback.

Happy reading!

Connect with us on Linkedin: Shubham Malik, Rohan Harode and Akash Singh Kunwar.

References:

[1] https://arxiv.org/abs/1603.02754
[2] https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
[3] https://explained.ai/gradient-boosting/index.html
[4] https://github.com/dmlc/xgboost
[5] https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205

XGBoost: A Deep Dive into Boosting

XGBoost, at a glance!

XGBOOST in action

Written by Rohan Harode