Let me learn the learning rate (eta) in xgboost! (or in anything using Gradient Descent optimization)

Laurae: This post is about choosing the learning rate in an optimization task (or in a supervised machine learning model, like xgboost for this example). The post can be found at Kaggle.

It does not work exactly like what I’m saying (this is not supposed to be linear at all), but it is very similar:

  • Imagine you need to compute exactly 5.235 steps to do to reach the optimum for your model (optimum = perfect bias/variance tradeoff = no underfitting and no overfitting).
  • The learning rate is the shrinkage you do at every step you are making. If you make 1 step at eta = 1.00, the step weight is 1.00. If you make 1 step at eta = 0.25, the step weight is 0.25.
  • If your learning rate is 1.00, you will either land on 5 or 6 (in either 5 or 6 computation steps) which is not the optimum you are looking for.
  • If your learning rate is 0.10, you will either land on 5.2 or 5.3 (in either 52 or 53 computation steps), which is better than the previous optimum.
  • If your learning rate is 0.01, you will either land on 5.23 or 5.24 (in either 523 or 534 computation steps), which is again better than the previous optimum.

Therefore, to get the most of xgboost, the learning rate (eta) must be set as low as possible. However, as the learning rate (eta) gets lower, you need many more steps (rounds) to get to the optimum:

  • Increasing eta makes computation faster (because you need to input less rounds) but does not make reaching the best optimum.
  • Decreasing eta makes computation slower (because you need to input more rounds) but makes easier reaching the best optimum.
One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.