Math behind GBM and XGBoost

Vikesh Singh Baghel
Analytics Vidhya
Published in
5 min readApr 13, 2019

Both GBM and XGBoost are gradient boosting based algorithm. But there is significant difference in the way new trees are built in both algorithms. Today, I am going write about the math behind both these algorithms.

Before I start, let’s understand what boosting in general is. Boosting in machine learning belongs to ensemble model family where the focus is on reducing primarily bias. It means that first built a model, find its residual and build another model on the residual. Repeat this process as many times as required. Mathematically, it can be represented as below.

Let’s say f(x) is your model,y is actual value, gamma is predicted value and L is loss function. First model f0(x) is built as below on {xi,yi}:

Now, calculate the residual(ri = (y-gamma)) and build second model h1(x) on {xi,ri}. Add h1(x) to f0(x) and get new improved model f1(x)

Repeat the above process again and again. So, generalized form of equation at say iteration m will be:

The above equation is general equation for a boosting algorithm. There are several ways to decide what proportion of hm(x) to be added to fm-1(x). The equation will look like.

There can be several ways of calculating alpha like average, weighted average, adaboost or use gradient boosting. Today, I am writing about gradient boosting.

Now, the question comes what is gradient boosting? The term gradient in gradient boosting comes from gradient descent incorporation into boosting. A gradient descent based method is used to decide alpha or step size. To calculate alpha, at say iteration m, first pseudo residual (rim) is calculated and new model hm(x) is built on {xi, rim}. Pseudo residual is calculated by:

Now, calculate alpha such that the Loss function is minimized.

Now plug in that values of alpha and hm(x) to get fm(x).

In GBM, the algorithm is same as in gradient boosting. The model is decision tree based i.e. f(x) and h(x) are CART trees. For a tree with T leaves, model hm(x) can be written as:

bjm is the value predicted (mean,median, max vote etc.) in the region Rjm (leaf j). If hm(x) for a tree is plugged in to gradient boosting equation, there will be alpha and bjm. In GBM, alpha and bjm are combined to get step rate for each leaves. So, there will be T alphas (step rate) in a tree with T leaves. The equations for GBM becomes:

So, we saw how GBM is built. You must have noticed that step rate calculation in GBM requires two steps 1)Calculate pseudo residual 2) Calculate alpha. Also, there are also chances of overfitting in GBM which could be reduced by adding regularization. XGBoost combines the two step for calculating step rate into one step and adds regularization terms in loss function to counter overfitting. In XGBoost, the tree at each iteration is built such that the gain is maximum at each split. Wait, what? Let’s see step by step.

First understand that the loss function in XGBoost has an additional regulation term omega. This loss function is used to calculate maximum gain which can be directly used in tree building process while splitting each node. So, this is how the two step process in GBM is reduced to single step with regularization.

Let’s see how Gain is calculated in XGBoost tree building. Starting with Loss function that needs to be minimized.

There can be several types of loss function(L in equation) like MSE for regression or cross entropy for classification. In order to form a general equation irrespective of loss function, Taylor series expansion of L function is done as below.

Loss function is expanded similar to Taylor series. The constant term generated is removed because they will not contribute in calculating gain while splitting a node in tree.

Regularization term omega for a tree can be written as:

Substituting the value of hm(x) as Tree equation and dropping m (iteration number) for neatness, Loss function become:

Simplifying the equation further,

Our objective is to minimize this loss function. Now, if you notice the first term, the summation term, is an equation of parabola. It can be seen as ax + bx². The minimum value for a parabola will occur at x = -a/2b. In our case, bj = -Gj/(Hj+lamba). Substituting bj in loss function, the optimum loss function becomes:

The Gain when a leaf is split into two leaves (Left L and Right R) becomes:

Using this gain, XGBoost builds tree structure.

I hope you like this and please clap if you did :)

--

--