What Makes XGBoost Fast and Powerful?

Özge Ersöyleyen
HYPATAI
Published in
4 min readJun 29, 2020

XGBoost is a highly efficient library designed to solve regression and classification problems under Gradient Boosting framework. It was developed by T. Chen & C. Guestrin in 2016 and called for eXtreme Gradient Boosted trees.

There are many optimized characteristics in XGBoost which makes it fast and powerful. Characteristics to be mentioned in this story could be grouped as in below hierarchy:

Now, let’s go through those characteristics to understand the power behind XGBoost.

Regularization

XGBoost performs tree boosting operations by establishing Regularized Learning Objective and using a Shrinkage Factor (or learning rate, eta). Shrinkage decreases the effect of each tree and leaves space for future trees’ impact on model performance. XGBoost uses these two techniques, as well as Column Subsampling (It also increases speed) to overcome overfitting.

Split Finding Algorithms & Parallelization

Split finding algorithms used in XGBoost work efficiently and accurately both with small and big data. Like traditional gradient boosting algorithms, XGBoost makes it possible to use the Exact Greedy Algorithm. In this algorithm, each possible value in sorted values belonging to a feature is visited for finding the splitting threshold, and this is done for each feature. For instance, the classical gradient boosting decision tree implemented in scikit-learn works based on this principle. This technique works well with data having a small-moderate number of instances and/or small-moderate number of features, however, it becomes extremely costly for large data sets. XGBoost overcomes this issue by using the Approximate Algorithm ( histogram-based split) in which data is split into percentiles, and only feature values around the Xth percentile become candidates used for finding the splitting threshold. However, it still can be very time consuming to do all the sorting operations on the whole data set to obtain the percentiles. This time, a system design called Column Block for Parallel Learning appears. A block is basically a small subset of data distributed among different machines. Each block performs the sorting operation in parallel, then results are merged, and quantile calculation stops being a problem.

Weighted Quantile Sketch is the technique for obtaining the quantile values after parallel sorting operations by merging results in a histogram. Normally, in a Quantile Sketch, each quantile in the histogram should have an equal number of data points. However, in Weighted Quantile Sketch, each (approximate) quantile in the histogram should have equal weight. Weight (or cover) in XGBoost is the number of points in the node (or leaf) for regression problems. Therefore, maintaining equal weight among quantiles is the same thing as maintaining an equal number of observations for regression. In other words, quantiles in regression are just ordinary quantiles. For classification problems, however, it has a different calculation. Namely, badly predicted instances have higher weights, and to maintain the equality of the weight in each quantile, instances with large residuals will go into more specialized quantiles, leading those quantiles to have less number of data points. This results in an increase in accuracy by possibly increasing the number of quantiles.

Next, we could talk about the Sparsity-aware Split Finding which explains how XGBoost handles sparsities in data, such as the presence of 1) missing data, 2) dense zero entries, 3) one-hot encoded values. To make the algorithm aware of those sparsities, XGBoost defines a default direction for them. Optimal default direction is found by trying both directions in a split and choosing the one which proposes a maximum gain. For finding the optimal direction, only non-missing observations are visited, and that results in achieving a lower computation complexity.

Cache-aware Access & Blocks for Out-of-core Computation

Other XGBoost characteristics also make it fast, and these are more related to computer hardware. One of them is Cache-aware Access. To calculate the gain in each split, XGBoost uses CPU cache to store calculated gradients and Hessians (cover) to make the necessary calculations fast. When data does not fit into the cache and main memory, then it becomes important to use the disk space. Blocks for Out-of-core Computation becomes essential in this case. It is a costly operation to read/write data from a hard drive. To overcome this, by using Block Compression each block is compressed by columns and decompressed when loading it into main memory. Although decompression takes time, it is still faster than disk reading. And in Block Sharding, if there are multiple disks available, data is split into those disks increasing disk reading performance.

Note that this story is not prepared to compare the mentioned characteristics with other Gradient Boosting frameworks like LightGBM. Many characteristics are similar (such as the use of histogram-based split finding), or some differences that make one superior to the other (for instance, use of leaf-wise splitting or gradient-based one side sampling in LightGBM) in these frameworks. The main objective of this story was to give an idea of how XGBoost works efficiently.

I hope you have enjoyed it!

[1]: Tianqi Chen, Carlos Guestrin (June 10 2016). XGBoost: A Scalable Tree Boosting System

--

--