# Understanding a bit xgboost’s Generalized Linear Model (gblinear)

Laurae: This post is about xgboost’s gblinear and its parameters. Elastic Net? Generalized Linear Model? Gradient Descent? Coordinate Descent?… The post was originally at Kaggle.

Ben Gorman wrote:

I’m trying to understand what gblinear does under the hood. Note that I’ve Googled this quite a bit and I’ve already seen all the very vague answers like “it implements a regularized linear model”, yada yada. I really want to understand this enough to reproduce XGBoost’s results.
Here’s a minimal example I worked up.
> train
x y
1: 13.36 37.54
2: 5.35 14.54
3: 0.26 -0.72
4: 84.16 261.19
5: 24.67 76.90
6: 22.26 67.15
7: 18.02 53.89
8: 14.29 43.48
9: 61.66 182.60
10: 57.26 179.44
After I train a linear regression model and an xgboost model with 1 round and parameters {`booster=”gblinear”`, `objective=”reg:linear”`, `eta=1`, `subsample=1`, `lambda=0`, `lambda_bias=0`, `alpha=0`} I get the following results
> test
x y Pred.linreg Pred.xgb
1: 47.75 153.23 146.25 155.7
2: 12.13 40.05 35.78 107.9
3: 89.05 274.37 274.34 211.1
4: 38.87 116.51 118.71 143.8
5: 27.30 80.61 82.83 128.2
6: 87.66 267.95 270.02 209.3
7: 39.33 114.97 120.14 144.4
8: 64.32 191.73 197.64 177.9
9: 13.18 48.28 39.04 109.3
10: 8.89 23.30 25.73 103.5
Can anyone replicate this and/or give me the formula I need to do it? Thanks!

If you mean by “formula” the code for the optimization, you will to rewrite from scratch all the code.

Put it simply it is really a “regularized linear model” using delta with elastic net regularization (L1 + L2 + L2 bias) and parallel coordinate descent optimization.

Therefore, what you need is:

• Code for coordinate descent
• Code for elastic net model with delta + delta bias
• Code for updating weights

Or, you pick an already existing elastic net model with delta + delta bias, and replace gradient descent by coordinate descent.

Computation-wise, its complexity is identical to gradient descent. Coordinate descent was largely covered during year 2012 in papers due to “big data” and how it outperforms gradient descent for very simple linear models, therefore finding papers explaining it should not be hard. But finding which one xgboost exactly uses might be (if I remember there are 3 variations of a simple coordinate descent optimization existing — cross-checking with xgboost source code will be necessary).

Delta bias is for the intercept while delta is for the feature weights. It is updated by delta_bias * shrinkage. To compute the delta bias, you do the regularization on gradient over hessian by delta_bias, which is, for an optimization update: update = shrinkage * delta_bias = shrinkage * (gradient + L2_bias*value) / (hessian + L2_bias) where L2_bias = ridge regularization (lambda_bias) on intercept.

For the delta, it is 0 when the hessian is under 0.00001. To compute it, you do like the delta bias, but with L1 and L2 regularization, for an optimization update: update = shrinkage * delta = shrinkage * (gradient + L2*value + L1) / (hessian + L2), where L1 = lasso regularization (alpha) and L2 = ridge regularization (lambda).

``> predict(xgb.train(data = xgb.DMatrix(data = as.matrix(c(13.36, 5.35, 0.26, 84.16, 24.67, 22.26, 18.02, 14.20, 61.66, 57.26)), label = c(37.54, 14.54, -0.72, 261.19, 76.90, 67.15, 53.89, 43.48, 182.60, 179.44)), booster="gblinear", objective="reg:linear", eta=1, subsample=1, lambda=0, lambda_bias=0, alpha=0, nrounds=1), xgb.DMatrix(data = as.matrix(c(13.36, 5.35, 0.26, 84.16, 24.67, 22.26, 18.02, 14.20, 61.66, 57.26)), label = c(37.54, 14.54, -0.72, 261.19, 76.90, 67.15, 53.89, 43.48, 182.60, 179.44))) [1] 109.53953  98.78446  91.95010 204.60300 124.72551 121.48959 115.79652 110.66740 174.39214 168.48424``