Understanding XGBoost & it’s growing popularity among the ML community

Published in

Analytics Vidhya

4 min readDec 9, 2019

Introduction

Gradient Boosting is one of the most popular machine learning techniques and is widely written about. Of all the implementations of this algorithm, one stands out and has become the talk of the town, that is, XGBoost a.k.a Extreme Gradient Boosting. It was created by Tianqi Chen from the University of Washington as a Scalable Tree Boosting System.

In this article, I intend to talk about how XGBoost adds features and improves on top of the Gradient Boosting technique to make it more scalable, efficient and fast.

Before we try to understand what XGBoost does, let’s try to understand its components and how they come together to become XGBoost.

Boosting

Boosting follows the principle of the famous quote by Walter Payton, “We are stronger together than we are alone”. Boosting is an ensemble learning technique to build an aggregate of weak learners which predict much better together than when they are used alone. In simple terms, these weak learners learn from each other’s mistakes and present a combined result.

In the figure, W1 is a weak learner that predicts FW1(x) depending on a complex rule. The next weak learner tries to fit the error of the first model in a new model W2 which tries to correct the previous predictions. This process is iterated until the highest accuracy is achieved.

There are various ways a weak learner can correct its previous learner’s prediction. Consider assigning equal weights to all predictions and adjusting their weights as we build new learners by increasing(boosting) weights of incorrect predictions. This technique is called AdaBoosting which minimizes the error by manipulating the weights.

By far the goals of a boosting algorithm are:

1. Build a weak learner and predict

2. Minimize loss function (errors)

3. Fit on a new model and predict. Repeat 2

Let us now try to minimize errors using the Gradient Descent method. It is one of the most effective and thus popular among the community. Boosting learners by minimizing loss function by moving towards the gradient makes the Gradient Boosting Algorithm.

Calculus Alert! Let us just look at how Gradient Descent works.

For a given cost function, we need to find its minimum point.

Ө𝑖 = Ө𝑖 — ρ 𝜕𝐽 / 𝜕Ө𝑖

We initialize at a random point and move towards the gradient which is the direction of the fastest increase. It is calculated by taking the first-order derivative of the function with respect to parameters iterated. The negative sign for the gradient is to move towards the minimum of the given function. This process is repeated step-wise until a minimum is achieved.

ρ is called the learning rate, which is also a hyper-parameter in the GradientBoosting Algorithm. This parameter denotes the size of the steps. If the learning rate is too small, it will take a long time to find the minimum. If it is too large, it might keep moving all around the curve and might not be able to find the best minimum.

These are the components that make XGBoost. But what does XGBoost do differently than GradientBoosting?

Parallel Learning

As we saw GradientBoosting, builds trees in sequential order. This takes a long time to train the model. XGBoost improves speed by building trees parallelly. It does this by storing the sorted data in a compressed column (CSC) blocks with columns sorted already. This allows the model to reuse these blocks across the algorithm. These blocks can be stored on distributed machines as well. This makes it more scalable and time-efficient.

Regularization Term

XGBoost also adds a regularization term to the loss function.

Objective Function = Loss Function + Regularization Term

The regularization term controls the complexity of the model to avoid overfitting. The algorithm intends to optimize this objective function to improve the performance of GradientBoosting.

Sparsity Aware Algorithm

XGBoost introduces Sparsity-aware Split finding for handling missing data or with a lot of zeros. It adds a default direction for each feature X. It classifies this direction using the non-missing data. This sparsity aware algorithm works much faster compared to a basic algorithm.

XGBoost is built to be scalable and fast which is why it is so popular in the data science industry and used to find business solutions.

Cache Aware Access

XGBoost allocates an internal buffer in memory to fetch gradient statistics used for computations in a mini-batch manner. This optimizes the memory allocation and speeds the process.

Out-of-Core Computation

In order to make XGBoost a scalable solution, it needs to use all the hardware resources in an efficient manner. To achieve this, XGBoost uses the block structure on disk memory and a thread performs computations on these blocks. It uses two methods to make it more efficient called Block Compression and Block Sharding.

Block Compression compresses and decompresses columns as and when required which saves memory. Block Sharding shards the data into disks alternatively which is read by a thread assigned to that disk.

Conclusion

XGBoost improves the GradientBoosting algorithm to make is more flexible, efficient and fast without compromising on performance. This is the key reason why the industry considers it a strong contender for one of the best Machine Learning algorithms.

Understanding XGBoost & it’s growing popularity among the ML community

Introduction

Boosting

Parallel Learning

Conclusion

Written by Deep Borkar