The Gradient in Gradient Boosted Trees: How to fit a tree to an arbitrary loss function
I’ve been trying to understand gradient boosted trees on and off for the last few weeks, and it was a surprisingly frustrating exercise trying to find the answer to this question since introductions skimmed over it.
The general boosting algorithm is fairly intuitive and is covered in a lot of places, but the part that no-one seemed to explain was how to fit a single tree to minimize an arbitrary differentiable loss function.
The answer was to be found in Elements of Statistical Learning Second Edition on page 358-9 in sections Steepest Descent and Gradient Boosting (Note how it’s 22 pages into the chapter on Gradient Boosting):
Take the prediction of your current model on the data you’re fitting as vector, and then take a single gradient step from that vector to minimize your loss function.
Subtract your original model output from this vector. You now have a residual for your arbitrary differentiable loss function
Fit a tree to that residual using a squared error loss, which has a simple non-gradient solution.
You have now done a single gradient step using a tree.