What makes “XGBoost” so Extreme?

Published in

Analytics Vidhya

24 min readJan 26, 2020

Introduction

Have you ever asked how “XGBoost” works ?, or how “XGBoost “differs from gradient boosting ?, or even a simple example in python. If you have then your in the right place. “Xgboost” is one of the most powerful machine learning tools available for tabulated data. It’s efficiency and performance in learning non linear decision boundaries have made it a staple in both industry and academia alike. So what makes this algorithm so special ?.

This article was written in response to the frustration I encountered when trying to answer this question. Don’t get me wrong there are numerous resources online about the topic but none simply explain why “XGBoost” is so special and how it differs from gradient boosting.

In order to fully test my understanding of the algorithm, I decided to implement “XGBoost” from scratch in Python. Thus this article will try to explain the learning process and the knowledge gained in replicating the core components of “XGBoost”. Including external material that I found helpful and a simple numerical example that show how the algorithm works.

In essence “XGBoost” builds upon the foundations of gradient boosting by introducing regularization to combat over fitting along with numerous other additions. One could even argue the case that it would be more appropriate to name “XGBoost”, as regularized gradient boosting as this is one of the most crucial aspects to its success. In order to understand “XGBoost” fully it is necessary to first grasp the ideas behind it. Therefore, this article will begin with an explanation about Regression Trees, Gradient Boosting and then “XGBoost” itself.

Regression Trees
Gradient Boosting
XGBoost

Regression Trees

The regression tree is a simple machine learning model that can be used for regression tasks. Unlike linear models, decision trees have the ability to capture the non-linear interactions between the features and the target variable. Tree based models also have the added benefit of being highly interpretable as the model is constructed by the recursive partitioning of the data by using simple Yes/No logic conditions on the feature values. However, they do have some downsides, regression trees are notoriously unstable, meaning if you perturb your data in the slightest way you might get a completely different tree structure. Regression trees are also prone to over fit data and lack a solid principled probabilistic framework preventing them from having useful features such as confidence intervals, posterior probabilities etc.

The basic pseudo code of an exact greedy (calculates every split point value) regression tree model is as follows:

Create the root node.
Iterate through every feature and find the best split value that maximizes some splitting criteria.
Check if any stopping criteria have been met if so stop building the tree otherwise continue with another node.

I am linking a brilliant video made by Josh Starmer who clearly explains the process better than I ever could so if you don’t understand how a regression tree works under the hood I really suggest that you watch this video. As understanding these base algorithms really translates to the more complicated ones.

This video perfectly illustrates the split finding procedure of a regression tree. However, it does not really cover the tree constraints used to combat over fitting. A regression tree has four main criteria that it uses to influence the tree structure while training these include, Tree Depth, The number of nodes or number of leaves, Number of observations per split and minimum improvement an explanation of which is given below.

Tree depth, depth of tree refers to the number of layers in the tree as a deeper tree will more likely over fit the data therefore people usually set a depth constraint. Deeper trees are more complex thus shorter trees are preferred. Generally, the best results are seen with 4–8 layers.
The number of nodes or leaves this is pretty self explanatory, very similar to the depth parameter but as depth is layer wise it is symmetric whilst this constraint is asymmetric can lead to unbalanced tree structures.
Number of observations per split imposes a minimum constraint on the amount of training data at a training node before a split can be considered.
Minimum improvement to gain is a constraint that requires a threshold improvement value for any new split to be added to the tree.

Tree Pruning is also a technique used to mitigate the effects of over fitting. However, it is applied post training and uses a metric such as cost complexity pruning. This method iterates over the tree removing leaf nodes until only the root remains. For each tree structure it defines a score which considers not only the accuracy of the tree’s predictions but also the complexity of its structure.

Classical Gradient Boosting

Gradient Boosting is a machine learning algorithm that can be used for both classification and regression problems. It creates an ensemble model from numerous weak predictors usually “Regression Trees” which are added in a stage wise fashion with each new tree focusing on the errors of the previous tree. This additive approach with a focus on previous mistakes essentially converts these weak learners into a single strong predictor.

The final prediction is the sum of all (M) trees which are weighted by a shrinkage factor lambda

One concept that I found really helpful to get an intuition for gradient boosting is to think of the problem in terms of gradient descent. I am essentially summarizing a fantastic blog post written by Nicholas Hug which will be linked here.

Just to quickly recap how gradient descent is applied to a simple Ordinary Least Squares estimator. If we take the negative of the gradient of the loss function this will essentially guide us to the minimum of the function aka our best model. Most models such as OLS use gradient descent on the models hyperparamteters such as the slope/intercept to find the optimal solution.

Image shows gradient descent for linear regression. Image sourced from http://alykhantejani.github.io

This can be expressed mathematically as follows our loss function is defined as the sum of the squared residuals:

In order to minimize the loss function we take the derivative of the loss function with respect to the slope and the intercept and add the negative of this value to the slope weight scaled by some learning rate to achieve a more accurate model this is process is repeated until convergence.

Updated slope and equivalent equation is used to calculate the new intercept too except the derivative of the loss function is with respect to the intercept.

So why do we care ?, and what has this got to do with gradient boosting ?. Well if you notice that the sum of the squared residuals is written in terms of the predictions themselves (y hat). What would happen if we took the derivative with respect to predictions themselves and iteratively updated the predictions themselves ?.

Updating the predictions using gradient descent

This is genius but there is one vital flaw we can’t make any predictions with this method as we need the true value Y in order to calculate the loss and update our predictions. Don’t worry this is where the sequential regression trees (weak learners) come into place. So instead of updating our predictions with the real value of the gradient, we will train a regression tree to predict these gradients, at each iteration. Thus, allowing us to make predictions on unseen data points.

Unfortunately we are not finished as the negative gradient only gives the direction of the step. Further effort is necessary to determine the step length (pm) that the will be taken in the direction. The most popular way to do this is a line search I will not go through all the math as someone already has here. Briefly we can write the loss function in terms of a tree do some math and we essentially end up with an equation to measure the quality of a tree structure with respect to the minimizing the loss function.

Gj is the sumof the gradients in a leaf and nj is the number of samples in that leaf node and T is the number of leaf nodes in the tree

As it is impractical to create every tree structure we can re-arrange this function to minimize the loss at every layer of the tree. This is a much better method than just using the internal mechanism of the regression tree to fit the residuals.

GL and nl are the sum of the gradients andd the number of samples on the left hand side of the split respectivly and GR and nR are the same but for the right hand side.

This is the core idea behind gradient boosting and the pseudo code for such an algorithm is as follows:

Gradient Boosting Pseudo Code

1.The algorithm begins by making an initial prediction that minimizes this equation below. If the task is regression then this equation boils down to using the average value of our target column as our initial prediction. Whereas, if we are concerned with classification then our initial prediction will use the log odds of our target variable.

2. Next we need to compute the negative gradient of the loss function with respect to our predictions. A regression tree is then trained to estimate this gradient in order to predict unseen values.

3. Now that we have our regression tree structure we still need to find a single value for each leaf node that will minimizes this summation below. Again this looks complicated and the math is ! but when you take the derivative with respect to gamma and set this equation to zero to find optimal leaf value.
It essentially tells us that the average value in a leaf node is the optimal value for regression.

4. This new regression tree is now added to the initial prediction, scaled by some shrinkage factor (learning rate). The negative gradient of this new model is then calculated and the entire process is repeated for a specific number of Boosting Rounds.

All of this may seem quite complicated but in practice it is actually surprisingly manageable. I will not do an example of gradient boosting by hand as there are many resources available online. Such as this video series by Statquest which I highly recommend watching to gain an intuition how this algorithm works in practice.

Although this series of videos eloquently describes the process of gradient boosting. If you were paying attention above you will notice that nearly all of the online resources use a naive approach to boosting. That is they use just a simple regression tree with its own internal split metric to fit the residuals.

So instead of using the defined loss function to construct the tree they simply grow the tree using some standard metric which has no direct link to problem at hand. However this is only a trivial fix as we simply replace the gain metric and sum up the leaf values in a different way but the process itself remains unchanged. On my GitHub repository I implement both a naive approach and on that follows Friedman’s model named “Gradient Boosting Machine”.

Gradient Boosting in Python from Scratch

For my own understanding I have implemented both gradient boosting for classification and regression in python only using Numpy and Pandas. I would not advice anyone to use this algorithm seriously. Rather it is a demonstration of the application of all the math from above. Hopefully if you are more comfortable with code might help you understand the material a bit better. As you can deconstruct the parts of the algorithm mess around with it and figure out how exactly everything works.

Implementation of the pseudo code described above

Implementation of the video series linked above

XGBoost

XGBoost was developed by T. Chen & C. Guestrin in 2016 and is described as an “optimized distributed gradient boosting library designed to be highly efficient, flexible and portable”. This algorithm has several noticeable improvements over classical gradient boosting notably:

Newton Boosting
Regularization
Handling Missing Values
Weighted Quantile Sketch
Custom loss function
Parallel Processing

Newton Boosting

As you already know by now it implements a gradient boosting framework that uses regression trees as weak learners plus some extra innovations. As we have seen above the classical implementation of gradient boosting can be viewed in terms of gradient descent, where the direction of travel is provided
by the negative gradient of the cost function and the step size is provided by the line search. This is the first major difference between “XGBoost” and the classical boosting algorithms as “XGBoost” is built upon the “Netwon Rhapson Method”. So instead of just computing the gradient and following it, it uses the second order derivative to gather more information to make a better approximation about the direction of the maximum decrease (in the loss function) and the step size . If we use gradient descent, we update our point x(i), at every iteration as follows:

Upside down triangle is the gradient and pm is the step size

Whereas, when using “Newton’s Method” x(i) is updated as follows:

Where “Hess” refers to the Hessian or the second order derivative of the loss function. This new update takes into account second order information
which will give a more precise estimate of the direction of highest decrease which should allow the model to converge at a much faster rate, also as the hessian is a constant value in many loss functions such as the mean squared error it is computationally inexpensive. This is the first major change that you will notice in my code as the regression trees now are not trying to simply fit the negative gradient of the loss function with respect to the predictions but both the negative gradient and the hessian.

Taylor Expansion of Loss Function

In order to use newton boosting, we use the Taylor expansion to rewrite the loss function around the current estimate in terms of the gradient and hessian. As we are using a tree to fit the data, this can then be further simplified into the summation of the leaf values of the tree.

Tm is the number of leaves in the tree and Gjm and Hjm are the sum of the gradient and hessian in the region aka leaf and Wjm is the sum of the leaf weights. The right hand side refers to the regularization term

This becomes our optimization goal for the new tree. One important advantage of this definition is that the value of the loss function only depends on Gi and Hi. This is how XGBoost can support custom loss functions.

Model Complexity

A large proportion of “XGBoost’s” versatility and accuracy can be attributed to it’s focus on model complexity. Whereas, previously gradient boosting models only focused on improving impurity/gain. “XGBoost” applies a variety of regurlaization techniques to avoid overfitting. XGBoost defines the complexity of a tree as:

Where T is the number of leaves, γ is the penalization term on the number of terminal nodes, α and λ are for L1 and L2 regularization respectively. Wj is the sum of all the weights of the leaves.

Finally, since L1 regularization in trees is applied to leaf scores rather than directly to features as in regression, it actually serves to reduce the depth of trees. This will reduce the impact of less-predictive features, but it isn’t as severe as in regression where L1 regularization (Lasso) can set the contribution of features to zero. Both L1 and L2 are used together in XGBoost similarly to Elasticnet. The L1 regularization will punish the less-predictive features and L2 (Ridge) is used to further punish large leaf scores without having a huge impact on the less-predictive features.

Structure Score

After re-formulating the tree model using the Taylor expansion, we can write the loss function with respect to the Tth tree:

As the tree structure is fixed we can take the derivative with respect to a leaf weight to find the optimal leaf weight value exactly like we did for gradient boosting.

If we then plug this optimal value back into the equation before we have now approximated the loss function in terms of the next tree structure. This can also be viewed as metric for the quality of the structure of the new tree similar to the cost complexity tree pruning that we seen before with the regression trees. Where the smaller the value the better the tree structure is at minimizing the loss.

Learn the Tree Structure

Now that we have a way to measure how good a tree is, we could build every possible tree structure but this is highly inefficient and impractical. Instead XGBoost optimizes one level at a time. This can be done by expressing the gain of a potential split in terms of the equation above. Thus we are using the loss function to construct the tree controlled by the model complexity

GL and HL are the sum of the gradient and the hessian values on the leaf values on the left hand side of the split and GR and HR are the sum of the gradient and the hessian values on the leaf values on the right hand side of the split.

This formula can be decomposed as 1) the score on the new left leaf 2) the score on the new right leaf 3) The score on the original leaf 4) regularization on the additional leaf. We can see an important fact here: if the gain is smaller than γ, we would do better not to add that branch. The role of the hessian in the denominator of the optimal leaf value equation is a bit ambiguous. However if we are using the RMSE loss function it’s second order derivative is essentially the number of samples. Thus normalizing by the hessian is giving us the average leaf weight. Where as in the case of logloss for classification it is a bit more intricate. The optimal leaf value essentially translates to the Z-score of the leaf for more info look here. Also this derivation is in line with the actaul paper written with Tianqi Chen and corresponds to equation seven. However for simplicity he removes the alpha term corresponding to L1 regularization term for a full derivation I would recommend visiting this article.

Weighted Quantile Sketch

To find the best split over a continuous feature, data needs to be sorted and stored in memory this is an issue when dealing with large data sets. To solve this issue a novel method for finding approximate split points is used.

Candidate split points are proposed based on the percentiles of feature distribution. The continuous features are binned into buckets that are split based on the candidate split points. For example if you had a feature with 100 values sorted {1,2,3,….100} we can calculate the gain at each of the 10 10-quantiles split points {10,20,30⋯,90} and have a good approximation of the distribution. This is what the eps value in “XGBoost” is doing. “XGBoost” only considers a split point when the split has ∼eps*N more points under it than the last split point. If eps=0.01 on the example above then you will end up with ∼100 split points, being larger than {1%,2%,…,99%} of the other points. “Xgboost” does not consider a new split when “the sum changes more than eps but when the number of points under the current point is larger by eps than the last one.

If your feature values are similar than there is no real point splitting between them instead you want to split the parts of your data set that are very wrong. This is where the weights play a role we use the hessian to act as weight that represents how wrong the prediction is. So now the first 10-quantile will not be the first point that is larger than 10% of the points, but the first point that is larger than 10% of the hessian values.

Handling Missing Values (Sparsity Aware)

Unlike previous algorithms each node not only has a split point that determines whether samples are filtered to the left or to the right but it also has a base general direction. This direction is pre defined in XGBoost to the left meaning that any Missing Values (NAs) are filtered to the left, as a logical statement cannot be applied to a missing value.

However if you train your XGBoost with NAs included in the dataset then the model will automatically learn the best direction to send them either to the left or to the right. It does this by calculating the gain at the each split point point with all the NA values on the left hand side first and then computes the gain with the NA values on the right had side. Which ever split point minimizes maximizes the gain is chosen and the direction is set. Any unknown values will then be passed through this direction at that particular node upon prediction. We can do this as even though the feature value is missing we don’t use it to compute the gain all we need is the gradient and the hessian values. This calculation looks like this:

So the gain is calculated with the sum of the gradients and the hessian of the nas on the left and on the right hand side

Parameters

XGBoost contains a wide variety of hyper-parameters some of these are quite cryptic relative to a standard regression tree thus I will try my best explain them.

Scale_Pos_Weight

This is a super useful parameter when dealing with an unbalanced data set. I have personally used it on a real life application with great success. Typically this parameter should be set as the ratio of the Majority Class to the Minority Class I usually use this function to calculate the scale pos weight factor, although like any other parameter for maximum accuracy some sort of optimization/tuning is advised.

I can only assume how this parameter effects the model as there is no mention of it within the actual paper or the documentation. Assuming that we have an imbalanced data set where we have 90% negative samples and Positive samples only account for 10%. The algorithm can either up sample meaning that it will repeat the positive examples 9 times to make it a balanced problem. However, I believe that the model instead builds a collection of models trained on 1/9th of the negative samples as it allows for easier cross validation of results. The outputs from each model are then ensembled together to create a more accurate result.

Min_Child_Weight

This parameter controls the model complexity. It is described as the “Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.”

For regression using the mean squared error the second derivative with respect to the predictions is simply 1. So when you get the sum of the hessian in a node it essentially equals the number of samples in that node. Whereas for binary logistic the Hessain for each point looks like:

Hessian of logloss the sigma character represents the sigmoid function

Lets get an intuition for this equation so if we have a pure node (either we have only ones or zeros) with all ones then all the predictions will be large so all the hessian terms will be near zero. Similar logic holds if the node only contains zeros. So essentially the min_child_weight is stopping the tree from making further splits once a certain degree of purity has been reached to avoid over fitting.

Eta

Step size shrinkage is used to prevent over fitting. After each boosting step, we will scale the contribution of the new base learner on the prediction with eta. More or less this parameter is equivalent to the learning rate.

Gamma

This parameter also prevents over fitting and is present in the the calculation of the gain (structure score). As this is subtracted from the gain it essentially sets a minimum gain amount to make a split in a node.

Subsample

“XGBoost” will subsample the ratio of the training instances used to train a single tree at a particular boosting step. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing its trees. This will happen for every tree so every tree will be trained on a different sample of the data which reduces variance as it reduces the dependence on certain values in the training sample. Both colsample and subsample are predicated upon the idea of a Random Forest.

Colsample

This parameter is based on the same idea as Subsample basically reduce the variance of the models by training on different parts of the data. Unlike row sampling there are three ways in which the XGBoost samples columns

Colsample_bytree: For this we construct the entire tree on the same sub sample of the data

Colsample_bylevel: At each tree level/depth we will only consider a sub sample of the columns to use. This sub sample will thus change for each level of the constructed tree.

Colsample_bynode: This is at the most granular level as each node would be constrained to use a different sub set of the data.

Colsample_by: all you to use a combination of all of the above sample methods. For instance, the combination {‘colsample_bytree’:0.5, ‘colsample_bylevel’:0.5, ‘colsample_bynode’:0.5} with 64 features will leave 8 features to choose from at each split.

Lambda

L2 regularization term on weights. Increasing this value will make model more conservative.

Alpha

L1 regularization term on weights. Increasing this value will make model more conservative.

Sketch_eps

This parameter is used in the quantile weighted skecth or ‘approx’ tree method roughly translates to (1 / sketch_eps) number of bins. If it was 0.1 and we had 100 examples then we would only calculate the gain for ten points.

Tree Method

Refers to algorithm used to construct the tree at a layerwise approach.

Exact: refers to exact greedy where the gain will be calculated for every split point for every column. Thus giving the most accurate result.

Approx: this equates to the weighted quantile sketch algorithm described above.

Hist: Is a super fast histogram optimized approximate greedy algorithm inspired by the approach used in lightgbm.

Max delta step

This parameter can be used for extremely unbalanced class problems. As we seen above for the min_child_weight, if the leaf node is almost pure which will happen (when we have very unbalanced classes) the leaf weights value become very large as the hessian will tend to zero and as it is on the denominator of the weight equation will approach infinity:

So despite multiplying the weight values by eta/learning rate to stop over fitting in the case of severe class imbalances this will not be enough. What max_delta_step does is to introduce an ‘absolute’ regularization capping the weight before apply eta correction. Personally I have never had great success with this parameter and have found from experience that the scale_pos_weight factor gives better results.

“XGBoost” By Hand

Apologies that was a lot of complicated maths, but I think it is beneficial to include it and to have some sort of knowledge about the theory behind the algorithm rather then leave it out. I know that all that math looks scary but it really isn’t that complicated to apply. To prove this to you I am going to show exactly how “XGBoost” works under the hood by doing all the calculations by hand on a toy data set.

First we will start of with our toy data our goal is to predict if someone plays video games based on two features age and gender. The data looks like this:

“XGBoost” starts by making an initial prediction of 1 for each of the sample data points.

This is used as the starting point for the algorithm in theory we could start with any value here and in fact gradient boosting begins with the log odds of the target variable as its initial prediction. Which in my opinion is a better initial prediction. Using this prediction can now calculate the negative gradient and hessian of our loss function with respect to our predictions in order to construct our first tree. But first we need to convert our initial prediction of 1 into a probability value using the sigmoid function as inside the tree as we are using the logistic loss function all our values are in log odds.

To use the sigmoid function we simply pass our initial prediction for each row through the function. Using these newly calculated probabilities we can now calculate the negative gradient and the hessian for each row in our data set. Again this is a simple procedure we just replace the p in our equations with our row probability and yi with the target variable in this case if a person plays video games.

Now we can start constructing our tree for the purpose of this example I have chosen a gamma,lambda and learning rate value of 0, 1 and 0.4 respectively. The tree will be constructed using the exact greedy method where the gain is calculated at every split point. So for every value in a feature we will split the data filtering values below or equal to itself to the left and any values greater than itself to right. If you are unsure about the procedure of creating a regression tree I highly suggest going back to the start and watching the regression trees video provided. Anyways we will start with the root node.

Above we can see that the second table stores all the gain values for every split point across both features. In order to calculate the gain at a specific split point we use the equation on the right hand side of the table lets do an example for a potential split at say age 24 then the math would look like this. We will take the sum of the hessian and gradient of all the values less than or equal to 24 (in green) on the left hand side and calculate the sum of the hessian and gradient for all the values greater than 24 (in red)on the right. Below is the calculations required to calculate the gain for this particular split point the equation essentially compares the benefit of the splitting the node or not splitting the node the subtraction part, also gamma is there to set a minimum threshold we wish to allow a split to happen. In this case the gain is negative so we would not make a split at this value.

So after repeating this calculation for every split possible we can see that our first split point will be at 16 years of age as this value has the largest gain value. The node is split in two, creating two new nodes the people who are 16 or younger and those who are older than 16. We can see that the split on the left or those people who are 16 or younger all play video games. This means that this is a pure node and there is no point make any further splits here as they will all have negative gain values. So instead we will focus on making further splits to the right hand side.

Now calculate the gain for every split inside are new node the maximum gain is given by a gender thus we will apply another split based on gender to the people who are older than 16. Again the left hand node is pure and will require no further splits whereas the right hand side can be split one more time from the look of things.

So now we have exhausted our splits and our tree is constructed if you notice we have also calculated our leaf values. Using the equation provided we essentially just sum up the gradient in the leaf node and divide by the sum of the hessian plus a constant lambda. This table format looks a bit messy in my opinion so I am going to clean it up and show it the form of a tree structure.

So now this is our first tree which we will now use to update our initial predictions the values in the leaf nodes are again in the log odds form and we will multiply there values by a learning rate of 0.4. Which ever leaf our samples fall into that will be their update value. Our new predictions will now look like.

So we have finished our first boosting round and now have our new predictions. To continue we will do the exact same process again by converting our newest predictions to probabilities and calculating the new negative gradient and hessian values

Second Tree

Now that we have our new updated values for our Gradient and Hessian we can again fit a tree structure using the exact same procedure as before.

Again this looks messy so I will represent our new tree like this instead. Where the red represents a split node and green is a leaf node.

We now use this tree to make predictions and add this to our previous prediction.Which ever leaf our samples fall into that will be their update value.

That’s it we have completed two boosting rounds by hand. In summary, this what we have done. First we made an initial log odds prediction of 1 for our entire data set. By converting this prediction into a probability using the sigmoid function we were able to calculate the negative gradient and hessian with respect to our initial prediction. Then we constructed a regression tree from both these quantities and calculated our leaf node values from the equations above. This tree was then used to update our initial log odds prediction. The process was then repeated for our new updated log odds prediction for two boosting rounds. This simple example hopefully was helpful in demonstrating the main mechanism behind “XGBoost” just to point out this example was missing weighted quantile sketch and sparsity aware fitting to make the example more coherent.

Implementation in Python from Scratch

This is python implementation of “XGBoost” written using Numpy and Pandas it was developed for my understanding aka I would not recommend using it but feel free to deconstruct it and hopefully it can make some of the concepts more tangible. This example does not include the sparsity aware fitting (the ability to handle NA’s) as I think this is a more novel aspect of the algorithm and any good pipeline should be handling missing data before it reaches the model.

Conclusion

I hope you enjoyed this post and it answered some of the questions that you had about “XGBoost”. If this post receives positive feedback then I might even consider creating something similar for Catboost. Also feel free to ask any questions if you have any doubts or if anything is unclear. This link will take you to the Github repository where the code lives.