The Gradient in Gradient Boosted Trees: How to fit a tree to an arbitrary loss function

I’ve been trying to understand gradient boosted trees on and off for the last few weeks, and it was a surprisingly frustrating exercise trying to find the answer to this question since introductions skimmed over it.

The general boosting algorithm is fairly intuitive and is covered in a lot of places, but the part that no-one seemed to explain was how to fit a single tree to minimize an arbitrary differentiable loss function.

The answer was to be found in Elements of Statistical Learning Second Edition on page 358-9 in sections Steepest Descent and Gradient Boosting (Note how it’s 22 pages into the chapter on Gradient Boosting):

Take the prediction of your current model on the data you’re fitting as vector, and then take a single gradient step from that vector to minimize your loss function.

Subtract your original model output from this vector. You now have a residual for your arbitrary differentiable loss function

Fit a tree to that residual using a squared error loss, which has a simple non-gradient solution.

You have now done a single gradient step using a tree.

Show your support

Clapping shows how much you appreciated kuza55’s story.