XGBoost: A BOOSTING Ensemble

Published in

AlmaBetter

8 min readJun 6, 2021

Does it really work as the name implies, Boosting?

As we know that XGBoost is an ensemble learning technique, particularly a BOOSTING one. Let’s go a step back and have a look at “Ensembles”.

A quick glance at Ensembles:

Ensembles in layman are nothing but grouping and trust me this is the whole idea behind ensembles. They combine the decisions from multiple models to improve the overall performance. It is sort of asking opinion on something from different people and then collectively form an overall opinion for that.

Below is the graphics interchange format for Ensemble that is well defined and related to real-life scenarios.

Ensemble Learning, Source: machinelearningknowledge.ai

Ensemble learning is considered as one of the ways to tackle the bias-variance tradeoff in Decision Trees.

There are various ways of Ensemble learning but two of them are widely used:

Bagging
Boosting

Let’s quickly see how Bagging & Boosting works...

BAGGING is an ensemble technique used to reduce the variance of our predictions by combining the result of multiple classifiers modeled on different sub-samples of the same data set.

In a nutshell, BAGGING comes from two words “Bootstrap” & “Aggregation”. Bootstrap refers to subsetting the data and Aggregation refer to aggregating the results that we will be getting from different models.

Random forest is one of the famous and widely use Bagging models.

Note: Bagging is PARALLEL process.

BOOSTING is a sequential process, where each subsequent model attempts to correct the errors of the previous model. The succeeding models are dependent on the previous model and hence work sequentially. It fits a sequence of weak learners − models that are only slightly better than random guessings, such as small decision trees − to weighted versions of the data. More weight is given to examples that were misclassified by earlier rounds/iterations.

Boosting Intuition, Source: machinelearningknowledge.ai

Mathematically, it can be expressed as below:

F(i) is current model, F(i-1) is previous model and f(i) represents a weak model

In a nutshell: Bagging vs Boosting is,

Many boosting algorithms impart additional boost to the model’s accuracy, a few of them are:

AdaBoost
Gradient Boosting
XGBoost
CatBoost
LightGBM

Remember, the basic principle for all the Boosting algorithms will be the same as we discussed above, it’s just some specialty that makes them different from others. We will now be focussing on XGBoost and will see its functionalities.

XGBoost

Is it truly a winning model?

XGBoost has become a widely used and really popular tool among Kaggle competitors and Data Scientists in the industry, as it has been battle-tested for production on large-scale problems.

The amount of flexibility and features XGBoost is offering are worth conveying that fact. Its name stands for eXtreme Gradient Boosting. The implementation of XGBoost offers several advanced features for model tuning, computing environments, and algorithm enhancement. It is capable of performing the three main forms of gradient boosting (Gradient Boosting (GB), Stochastic GB, and Regularized (GB) and it is robust enough to support fine-tuning and addition of regularization parameters.

Salient Features of XGboost:

Regularization: XGBoost has an option to penalize complex models through both L1 and L2 regularization. Regularization helps in preventing overfitting.
Handling sparse data: Missing values or data processing steps like one-hot encoding make data sparse. XGBoost incorporates a sparsity-aware split finding algorithm to handle different types of sparsity patterns in the data.
Weighted quantile sketch: Most existing tree-based algorithms can find the split points when the data points are of equal weights (using a quantile sketch algorithm). However, they are not equipped to handle weighted data. XGBoost has a distributed weighted quantile sketch algorithm to effectively handle weighted data
Block structure for parallel learning: For faster computing, XGBoost can make use of multiple cores on the CPU. This is possible because of a block structure in its system design. Data is sorted and stored in in-memory units called blocks. Unlike other algorithms, this enables the data layout to be reused by subsequent iterations, instead of computing it again. This feature also serves useful for steps like split finding and column sub-sampling
Cache awareness: In XGBoost, non-continuous memory access is required to get the gradient statistics by row index. Hence, XGBoost has been designed to make optimal use of hardware. This is done by allocating internal buffers in each thread, where the gradient statistics can be stored
Out-of-core computing: This feature optimizes the available disk space and maximizes its usage when handling huge datasets that do not fit into memory

Why use XGBoost?

The two major reasons to use XGBoost:

Execution Speed.
Model Performance.

But, how does it work?

The XGBoost library implements the gradient boosting decision tree algorithm.

Let’s quickly see Gradient Boosting, gradient boosting comprises an ensemble method that sequentially adds predictors and corrects previous models. However, instead of assigning different weights to the classifiers after every iteration, this method fits the new model to new residuals of the previous prediction and then minimizes the loss when adding the latest prediction. So, in the end, you are updating your model using gradient descent and hence the name, gradient boosting. This is supported for both regression and classification problems.

Objective function and Optimization

The objective function (loss function and regularization) at iteration t that we need to optimize is the following:

Attaching hand-written notes to understand the things in a better way:

Regularization term in XGboost is basically given as:

The mean square error loss function form is very friendly, with a linear term (often called the residual term) and a quadratic term. It is not easy to get such a good form for other notable loss functions (such as logistic loss). So in general, we extend the Taylor expansion of the loss function to the second-order

Why we need to use the Taylor expansion?
Because we need to transform the original objective function to a function in the Euclidean domain, in order to be able to use traditional optimization techniques.

This becomes our optimization goal for the new tree. An important advantage of this definition is that the value of the objective function depends only on pi with qi. This is how XGBoost supports custom losses.

Hyperparameters in XGboost:

The authors of XGBoost have divided the parameters into four categories, general parameters, booster parameters, learning task parameters & command line parameters. Here, I have highlighted the majority of parameters to be considered while performing tuning.

General Parameters: Parameters that define overall functionality

booster (default = gbtree): can select the type of model (gbtree or gblinear) to run at each iteration.
silent (default = 0): if set to one, silent mode is set and the modeler will not receive any feedback after each iteration.
nthread (default = max # of threads): used to set the number of cores to use for processing.

Boosters Parameters: Tree & Linear

eta (default = 0.3): Learning rate used to shrink weights on each step. Typical final values fall in between 0.01~0.2.
min_child_weight (default = 1): Used to control overfitting and defines the minimum sum of weights of all observations required in a child. A larger number restricts models' ability to learn finer details of the training set.
max_depth (default = 6): Typically 3–10, defines the maximum depth of a tree.
max_leaf_nodes: The number of terminal nodes of leaves in a tree. Has a mathematical relationship with the depth of the tree.
gamma (default = 0): Specifies the minimum loss reduction required to make a split.
subsample (default = 1): Defines the fraction of observations to be used when sampling randomly from each tree. Typically values range between 0.5–1.
colsample_bytree (default = 1): Fraction of columns to be used when random sampling for tree build-out.
lambda (default = 1): L2 regularization term.
alpha (default = 0): L1 regularization term.
scale_pos_weight (default = 1): A value greater than 0 should be used in case of high-class imbalances.

Learning Task Parameters: defines the optimization objective

objective (default = reg: linear): Defines the loss function to be minimized. Options include binary:logistic, multi:softmax, multi:softprob.
eval_metric: Metric used for validation. Options include rmse(default for regression), mae, logloss, error (default for classification), merror, mlogloss, AUC.

Command Line Parameters: Used in the console version of XGBoost

num_round: The number of rounds for boosting
data: The path of training data
test: data : The path of test data to do prediction
save_period [default=0]: The period to save the model. Setting save_period=10 means that for every 10 rounds XGBoost will save the model. Setting it to 0 means not saving any model during the training.
task [default= train] options: train, pred, eval, dump.
model_in [default=NULL]: Path to input model, needed for test, eval, dump tasks. If it is specified in training, XGBoost will continue training from the input model.
model_out [default=NULL]: Path to output model after training finishes. If not specified, XGBoost will output files with such names as the 0003.model where 0003 is the number of boosting rounds.

After covering all these things, you might be realizing “XGboost is worth a model winning thing”, right?

This is it for this blog, I will try to do a practical implementation in Python and will be sharing the amazing results of XGboost in my upcoming blog.

If you want to know something more specific to XGBoost, you can refer to this repository: https://github.com/Rishabh1928/xgboost

Happy Learning!!