Deciding on How to Boost Your Decision Trees

Stephanie Bourdeau
5 min readDec 5, 2019

--

A Comparison on the Gradient Boosting Decision Tree Algorithms of XGBoost, CatBoost, and LightGBM

Gradient boosting is an ensemble technique in machine learning that tunes the parameters of a moderately sized dataset to make accurate classifiers from the residuals of previously used weak learners. Given that the predictions of a weak learner are only slightly better than random chance, the performance of one of these models is boosted by combining the model’s residuals with a loss function to calculate the overall loss of the model. Gradient boosting decision trees then minimize the loss by continuously training a model and improving on the error of the previous model for a superior model. The gradients and loss from each tree are used to train the next tree against. No wonder a lot of Kaggle competitors have used one of these algorithms with great success!

Originally, AdaBoost was the most effective algorithm for adaptively boosting data as it continually influences the distribution of sampled data, by weighting observations, to train the next learner. Training in AdaBoost, however, can be time consuming with each iteration redefining weights to classifiers and it can become sensitive to outliers and noisy data. To alleviate this, several decision tree based gradient methods have been created to surpass the shortcomings of weak learners and other ensemble methods. Here, we will focus on XGBoost, LightGBM, & CatBoost and the advantages of each library.

XGBoost

Extreme Gradient Boosting, or XGBoost, uses a pre-sorting splitting algorithm to separate the features of the data into separate leaves. This occurs by enumerating over all the features at each node, sorting the instances of each feature by their respective value, determining the best split based on information gain, taking the best split and then repeating the process for n iterations. This algorithm has various features that allow for the boosting of the weak learners such as sparse aware implementation to automate the handling of missing values and block structure to support the parallel construction of the tree. The key feature to XGBoost is that is a scalable boosting tree system.

The main limitation to XGBoost is that there are not methods to handle categorical features as it only accepts numerical data. One must perform label encoding, either one-hot or mean, prior to introducing categorical data to this library. Despite being a reliable method for gradient boosting, XGBoost underperforms in comparison to its newer counterparts in terms of tuning time.

LightGBM

The light gradient boosting machine (LightGBM) was created by Microsoft early in 2017 to overcome the implementation time of previous models in creating decision trees. By growing the decision tree on a leaf-by-leaf basis, as opposed to training on each leaf from the previous level at once, LightGBM increases the overall accuracy in comparison to XGBoost and does so with a faster training speed. Leaf-wise tree growth is crucial in decreasing loss, but can lead to potentially overfitting if a max_depth parameter is not set. The training speed of LightGBM can be attributed to the lower memory usage from replacing continuous values to discrete bins.

LightGBM implements Gradient-based One-Side Sampling to subset the data under the assumption that not of the data points have the same contribution to training the model. This allows it to ignore smaller gradients when computing the best split and then randomly samples from these neglected data points and increases their weight of these samples to reduce bias towards the data with larger gradients.

Several tuning parameters in LightGBM can be manipulated to increase its precedence over XGBoost. For instance, the overall fit of the model can be improved by setting the number of leaves to be formed in the decision tree (num_leaves), and this is theoretically related to the max_depth of the tree. To avoid overfitting, the number of leaves should be set to less than 2^(max_depth) given that the splits here occur leaf wise rathe than depth wise. Another factor in controlling the overfitting of data here is the min_data_in_leaf parameter, which controls the minimum number of data points to be placed in each leaf. With a small minimum number, more leaves would be created which could misconstrue our desired fit. In terms of speed, bagging & feature_fraction can be defined to determined the ratio of features to be used in each iteration — thus increasing the speed of each iteration. As mentioned with the discrete binning of features, having max_bin set to a smaller number allows for inexpensive computations of bucketing features, which too improves accuracy of the model.

CatBoost

Categorical Boosting, or CatBoost, as the name suggests handles categorical features better than LightGBM by using permutation techniques to create symmetric trees. This occurs with one-hot-encoding of all features with a number of different values less than or equal to the given parameter value.

To reduce overfitting, CatBoost encodes the data by randomly permuting the input observations to generate multiple permutations, converting the label value to an integer, then transformation the categorical feature values to numeric values by adding the value of the starting parameters to the CountInClass (number of label values = 1 for objects in current categorical feature value), and dividing that by the total count of objects up until the current one plus 1.

The main limitation with CatBoost is that the it is most effective only when we take advantage of its categorical feature handling and one_hot_max_size. Accuracy is sufficiently compromised when these parameters are not tuned as the main focus on this library is to boost your categorical data. When comparing the use of CatBoost without categorical data to that of XGBoost also without categorical data, XGBoost tends to provide higher accuracy in a longer amount of time.

Accuracy score vs number of iterations

--

--