Understanding a bit xgboost’s Generalized Linear Model (gblinear)

Laurae: This post is about xgboost’s gblinear and its parameters. Elastic Net? Generalized Linear Model? Gradient Descent? Coordinate Descent?… The post was originally at Kaggle.

Ben Gorman wrote:

I’m trying to understand what gblinear does under the hood. Note that I’ve Googled this quite a bit and I’ve already seen all the very vague answers like “it implements a regularized linear model”, yada yada. I really want to understand this enough to reproduce XGBoost’s results.
Here’s a minimal example I worked up.
> train
 x y
 1: 13.36 37.54
 2: 5.35 14.54
 3: 0.26 -0.72
 4: 84.16 261.19
 5: 24.67 76.90
 6: 22.26 67.15
 7: 18.02 53.89
 8: 14.29 43.48
 9: 61.66 182.60
 10: 57.26 179.44
After I train a linear regression model and an xgboost model with 1 round and parameters {`booster=”gblinear”`, `objective=”reg:linear”`, `eta=1`, `subsample=1`, `lambda=0`, `lambda_bias=0`, `alpha=0`} I get the following results
> test
 x y Pred.linreg Pred.xgb
 1: 47.75 153.23 146.25 155.7
 2: 12.13 40.05 35.78 107.9
 3: 89.05 274.37 274.34 211.1
 4: 38.87 116.51 118.71 143.8
 5: 27.30 80.61 82.83 128.2
 6: 87.66 267.95 270.02 209.3
 7: 39.33 114.97 120.14 144.4
 8: 64.32 191.73 197.64 177.9
 9: 13.18 48.28 39.04 109.3
 10: 8.89 23.30 25.73 103.5
Can anyone replicate this and/or give me the formula I need to do it? Thanks!

If you mean by “formula” the code for the optimization, you will to rewrite from scratch all the code.

Put it simply it is really a “regularized linear model” using delta with elastic net regularization (L1 + L2 + L2 bias) and parallel coordinate descent optimization.

Therefore, what you need is:

  • Code for coordinate descent
  • Code for elastic net model with delta + delta bias
  • Code for updating weights

Or, you pick an already existing elastic net model with delta + delta bias, and replace gradient descent by coordinate descent.

Computation-wise, its complexity is identical to gradient descent. Coordinate descent was largely covered during year 2012 in papers due to “big data” and how it outperforms gradient descent for very simple linear models, therefore finding papers explaining it should not be hard. But finding which one xgboost exactly uses might be (if I remember there are 3 variations of a simple coordinate descent optimization existing — cross-checking with xgboost source code will be necessary).

Delta bias is for the intercept while delta is for the feature weights. It is updated by delta_bias * shrinkage. To compute the delta bias, you do the regularization on gradient over hessian by delta_bias, which is, for an optimization update: update = shrinkage * delta_bias = shrinkage * (gradient + L2_bias*value) / (hessian + L2_bias) where L2_bias = ridge regularization (lambda_bias) on intercept.

For the delta, it is 0 when the hessian is under 0.00001. To compute it, you do like the delta bias, but with L1 and L2 regularization, for an optimization update: update = shrinkage * delta = shrinkage * (gradient + L2*value + L1) / (hessian + L2), where L1 = lasso regularization (alpha) and L2 = ridge regularization (lambda).

As a minimization process is used, a sign inversion is required.

Reproducing the result is not guaranteed, as you need an initialization. For nround = 1 it should be deterministic even any nthread, although it is not guaranteed between OSes. When nround > 1 and nthread > 1, results cannot be reproduced (unless this issue got fixed).

I’m getting this as result:

> predict(xgb.train(data = xgb.DMatrix(data = as.matrix(c(13.36, 5.35, 0.26, 84.16, 24.67, 22.26, 18.02, 14.20, 61.66, 57.26)), label = c(37.54, 14.54, -0.72, 261.19, 76.90, 67.15, 53.89, 43.48, 182.60, 179.44)), booster="gblinear", objective="reg:linear", eta=1, subsample=1, lambda=0, lambda_bias=0, alpha=0, nrounds=1), xgb.DMatrix(data = as.matrix(c(13.36, 5.35, 0.26, 84.16, 24.67, 22.26, 18.02, 14.20, 61.66, 57.26)), label = c(37.54, 14.54, -0.72, 261.19, 76.90, 67.15, 53.89, 43.48, 182.60, 179.44)))
[1] 109.53953 98.78446 91.95010 204.60300 124.72551 121.48959 115.79652 110.66740 174.39214 168.48424

For the exact code, it should be under src/gbm/gblinear.cc in xgboost. You can get a look at it to understand in depth what is done when you use gblinear, but reproducing them will require you to solve the RNG + optimization code issues if you want “your own gblinear”.

Like what you read? Give Laurae a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.