Building Scalable Tree Boosting Methods- Tuning of Parameters

One of the most efficiently used Machine Learning Techniques is the tree boosting technique. We explore in this blog, insights on the cache access pattern and also data compression along with sharding to build the scalable tree boosting algorithm.

Published in

Intel Student Ambassadors

4 min readAug 6, 2019

XGBoost or Extreme Gradient Boosting is an optimized implementation of the Gradient Boosting algorithm.

Since its introduction in 2014, it has been the most useful algorithm of machine learning hackathons and competitions because of its prediction performance and processing time.

In decision-tree based machine learning, Boosting algorithms are mainly used extensively for implementing a sequential process where each model attempts to correct the mistakes of the previous models.

The idea is to convert many weaker base learning models to stronger learning models.

Gradient Boosting employs the gradient descent algorithm to minimize errors in sequential models. The major inefficiency in Gradient Boosting is that it creates one decision tree at a time.

XGBoost, on the other hand, features parallelized tree building, cache-aware access, sparsity awareness, regularization, and weighted quantile sketch as some of its systems optimization.

Split-finding algorithms

To find the best split, it is common to use an exact greedy algorithm, which calculates all possible splits on all the features. This is a powerful and thorough method, but as you can probably guess, it is computationally expensive and impossible when the data does not fit into memory. In order to meet the second criteria for scalability, we use an approximate algorithm instead.

Weighted quantile sketch

Whereas exact greedy algorithms consider all splitting points, an approximate algorithm only considers candidate split points. This is done by using quantiles to determine where to split, and all candidates are distributed evenly across the data. In other words, if there are 100 possible split points, we can get a pretty good approximation, ϵ, with 10-quantile split points.

Now, each approximated point is weighted by hᵢ. The reason for this is because if every point has equal weights, it would be meaningless to split between them. Hence, we use weighted quantiles, so that the first 10-quantile is the first point that is larger than 10% of the weights, instead of the points.

Sparsity awareness

In real-world problems, we often encounter sparse datasets as a result of missing data or feature engineering. XGBoost is able to handle the sparsity by learning the best imputation value from the data for the missing values. The best imputation value is then chosen by the value that results in the greatest reduction in training loss. The impact is clearly demonstrated when comparing the sparsity-aware algorithm, and a naive algorithm on a very sparse dataset. XGBoost runs 50X faster!

Comparison of Basic and Sparsity aware Algorithm

Penalized Gradient Boosting

Additional constraints can be imposed on the parameterized trees in addition to their structure.

Classical decision trees like CART are not used as weak learners, instead, a modified form called a regression tree is used that has numeric values in the leaf nodes (also called terminal nodes). The values in the leaves of the trees can be called weights in some literature.

As such, the leaf weight values of the trees can be regularized using popular regularization functions, such as:

L1 regularization of weights.
L2 regularization of weights.

Improvements to Basic Gradient Boosting

Gradient boosting is a greedy algorithm and can overfit a training dataset quickly.

It can benefit from regularization methods that penalize various parts of the algorithm and generally improve the performance of the algorithm by reducing overfitting.

In further sections we will look at 4 enhancements to basic gradient boosting:

Tree Constraints
Shrinkage
Random sampling

Below are some constraints that can be imposed on the construction of decision trees:

Number of trees, generally adding more trees to the model can be very slow to overfit. The advice is to keep adding trees until no further improvement is observed.
Tree depth, deeper trees are more complex trees and shorter trees are preferred. Generally, better results are seen with 4–8 levels.
Number of nodes or number of leaves, like depth, this can constrain the size of the tree, but is not constrained to a symmetrical structure if other constraints are used.
Number of observations per split imposes a minimum constraint on the amount of training data at a training node before a split can be considered
Minimim improvement to loss is a constraint on the improvement of any split added to a tree.