We have focused on the mathematical intuition on XGBoost last post. Objective Function is the ultimate goal of machine learning models and specifically, we would like to minimize the difference between the actual output or label vs predicted label. XGBoost uses what is called Structure Score of the tree to minimize the error.. The most important difference between XGBoost from Gradient Boosting Model is that the performance is amplified through various enhanced algorithmic approach. I will go over the algorithms that XGBoost utilizes that makes the model extreme.
Split Finding Algorithm
The optimal task for a tree models to have a best performance is to find the best split point to make clear distinctions between data points. As we have discussed before, XGBoost’s split depends on the structure score.
You might wonder, why can we just assess all the possible split points and compute the score and choose the best feature that provide us the best score?
You are right, in fact, that is one of the most popular method in a single machine boosting model. This type of algorithm is called Basic Exact Greedy Algorithm. As the name suggest, it search over the entire split point to find the best split.
Basic Exact Greedy Algorithm
Let’s go over line by line:
- Input I and d
- Input to this function is instance set of current node (that is all the samples within the node)
- d: dimension of feature. - Initialize G and H (*If you are confused with what G and H represents, pleas refer to the XGBoost Part I)
- G = Sum of first order derivative of instances i
(First order derivative of gradient in other words)
- H = Sum of second order derivative of instances i - for k=1 to m do
- From feature 1 to m (iterate through all the features) do the following - Initializing G -Left and H-Left to 0
- For j in sorted (I, by Xjk) do
*Important point here is ‘Sorted’ part. Because algorithm goes over all the possible split points, it is computationally heavy. Sorting of the dataset will mitigate some of the computational burden.
For example we are dealing with a certain feature x and it has many values associated if we perform split without sorting, it must first find the lowest value for feature x (which is 2) and so on (So every time with unsorted dataset, algorithm must first search for the lowest value which is very inefficient). When sorting is done before the split it can iterate through the row.
6. GL = GL + gj and HL = HL + hj
Update GL and HL by calculating g and h for each instances:
- Left node’s G is sum of all the g in node j
-Left node’s H is sum of all the h in node j
7.GR = G-GL and HR = H-HL
- Right node’s G is G of parent node substract G of left node
- Left node’s H is H of parent node substract H of left node
8. Score = Max(Score, New Structure Score)
- Recall from the Loss Reduced from the split, model would want to decide whether the split is reducing the error or not. This is done by adding left and right node values from the parent node (Explained from Part I XGBoost) and when the sum of Left and Right is bigger than the parent node’s score, the loss is reduced and split will provide better performance. Therefore find Max(Score from previous vs Score after split)
*Note: If you are familiar with Data Structure and Algorithm, you can understand the concept behind this algorithm:
Characteristics of Greedy Algorithm
- Data must be sorted
- It does not foresees the future, meaning it chooses the path that benefits right now.
- In XGBoost what is means is, when building a tree, it must undergo various split based on the score it provides and this is done in a sequential manner. Whenever there is a split, it looks at the entire feature values and choose the best split at that time. When another split is done, again, it looks for the split providing the best score. (Greedy Algorithm Does not look at the big picture →Entire tree score, but only the ones that important right now →Node j split point)
Approximate Algorithm for Split Finding
Exact Greedy Algorithm is commonly used in various tree models, however it would not fit the intention of the XGBoost model. One of the advantage of XGBoost is its run time. Approximation algorithm is one of the method that allows XGBoost to perform its computation in extreme speed.
The process of the Approximate Algorithm is quite simple yet confusing when view through pseudo code. I will explain the logic here:
- for k=1 to m do
(Loop through all the features and perform the following) - Propose Sk by percentiles on feature
What the approximate algorithm is trying to perform here is that it wants to divide the range of features into percentiles. For example, if we have feature x with 100 samples with feature value 1 → 100. This approach will simply divide the distribution of feature x into n buckets. Let’s say we want to split this feature into 10 different buckets, because we are using percentiles, each bucket will have identical number of samples and in this example each bucket contains values from [1 →10] , [11 →20] and so on. (Note: Similar to Greedy Algorithm, we assume the values are sorted)
3. Proposal can be done Global or Local (I will explain this later)
4. Find G and H for feature k for each bucket
Recall G and H is the value that the model needs in order to find the optimal split point. One of the confounding feature of XGBoost is that computation of score for each bucket can be processed in parallel, this is a feature that differentiates XGBoost from other boosting models. (Note: Boosting model is usually sequential processing model and bagging is considered parallel).
Similar to the greedy algorithm, we treat each bucket separately and therefore perform the split on each data point. For example, assume individual bucket has 4 different data points therefore split score computation will be done 3 times. Since the computation is done in parallel, it can identify the best score rapidly
Figure above is an example where we have 40 data points and if we were to do basic greedy algorithm, it must perform 39 different splits sequentially. However with the approximate version (10 total buckets), overall of 30 different splits are done in parallel.
Bucketing: Global Variant vs Local Variant
Bucketing can be done in two different methods, either Global (tree) vs Local (per split). This means that the bucketing can be done once per entire tree or do the bucketing every time when there is a split
As you can see from the figure above, when it is Global Variant, parent node will split into two child nodes from the best split point. As you can see from the red lines representing the bucket, it is maintaining the parent node’s bucket points.
Local variant on the other hand, does the bucketing process every time there is a split from the parent node. Here the two child nodes (left and right) will have 10 buckets again.
*If you understand the concept you might question, then what is the best bucket number? For global variant since it is maintaining the bucket, it would be best to get as many bucket as possible. Local variant, since there is bucketing every time there is a split, it is beneficial on the bigger tree. Bucketing is one of the hyperparameter of XGBoost. Eps is a hyperparameter that defines the bucket number.
1/Eps = Number of Candidate points
Ex. 1/0.1 = 10 → 10 Buckets
Weighted Quantile Sketch Algorithm
Assuming that the approximate algorithm is a basis of XGBoost there is another implementation of enhanced algorithm that is used by the model.
What is Sketch Algorithm?
Before I go over to the weighted quantile sketch algorithm, it is essential to know the basics of the Sketch algorithm. When there is a large size data, it would be inefficient and slow for computation to be done for the entire dataset and therefore do computations on small number of datasets in parallel then combine the results to get the final output would be beneficial. Think of it as using the sample distribution to assume the population distribution.
What is Quantile Sketch Algorithm?
Now we know the idea of Sketch Algorithm, let’s just add quantiles to it. Like we used percentiles in the approximate algorithm, using the quantiles to find the candidate split point is possible.
- Split Large dataset into smaller datasets and run them in parallel to find the approximate histogram of the dataset (Sample Dataset → Original Dataset approximation)
- Use the approximate histogram to approximate the quantiles. (Same approach as Approximate algorithm). Using eps, make them into n buckets (It is called approximate because we are not 100% confident that the sum of smaller dataset’s = Original dataset)
This is indeed almost the same as a Approximate Algorithm except that Sketch is applied when we are dealing with Large amount of dataset.
Now Weighted Quantile Sketch.
Different from normalized quantile where each quantiles have equal number of samples, weighted quantile does not have same number of samples but instead make each quantile to have equal/similar summed weights.
By rearranging the loss function after the Taylor’s expansion, we have the following equation and as you can see hi is used as a weight. Which is the 2nd order derivative of the gradient (AKA Hessian). For each of the instances i, we know that hi can be calculated.
For regression problem, the weights on each quantile will be 1 (if you do the math this will make sense). For Classification problem, the hessian for the log loss function becomes:
previous probability * (1- previous probability)
Therefore whenever the previous probability is close to the actual value, there will be less weight given to the data but when there is a huge difference between the previously predicted probability and the actual label, more weights are given. As a result when the prediction is not confident, quantiles will have higher weight and less data points end up in that quantile.
Sparsity-Aware Split Finding Algorithm
It would be nice if all the real world data are dense however, that cannot happen. When we are dealing with the real world data, datasets are always sparse. Sparsity can occur via various reasons:
- Missing values
- 0 entries
- Result of feature engineering (One-Hot Encoding)
The another advantage of XGBoost, scalability comes in here to able to process the sparse data very efficiently (Expedited runtime). Let’s look at the pseudo code and understand the logic behind this algorithm.
- Same input I and d as previous algorithms but additional input of Ik
- Ik is a missing value (sparse) within the dataset and specifically for feature k - Same initialization of G and H because these are required for split decision.
- For each feature k →m do the following
Let’s assume we have the following dataset with orange box representing the spare data.
4. Enumerate missing value go to right
Within the dataset send all the missing values to the right of the dataset and perform the split (same process as previous algorithms). The best split point will happen like the figure below (Class labels are separated at best)
5. Enumerate missing value to left
Same approach but send the missing dataset to the left this time and with the same process of computing the score will result:
as you can see from the figure, when sparse data are sent to the left we have more definite split between the two classes (0 and 1) and therefore performing the left enumeration is chosen for the entire feature split.
The default direction decision will be done with all the features and each feature will have their own direction. In this example, age feature has left as their direction and male feature has right as their default.
First split will occur with feature age and because we know X1 is spare on age column it will move to the left node. X2 is not sparse on age and have age less than 20, will end up in the left node. Once again is male was chosen as criteria for splitting. X1 is not spare on this feature and therefore end up in the left child node. X2 on the other hand is spare on gender column, will result in right direction (default direction)
Conclusion
We have gone over the most fundamental part of XGBoost. Mathematical intuition for objective function and the algorithmic approach that makes the XGBoost extreme and distinct from typical tree models. There are more about this model that makes it more sophisticated such as System Design, and hardware optimization via cache-aware access but this is not really important when we want to understand the theory behind the model.