The Gradient in Gradient Boosted Trees: How to fit a tree to an arbitrary loss function

kuza55
kuza55
Jul 29, 2017 · 1 min read

I’ve been trying to understand gradient boosted trees on and off for the last few weeks, and it was a surprisingly frustrating exercise trying to find the answer to this question since introductions skimmed over it.

The general boosting algorithm is fairly intuitive and is covered in a lot of places, but the part that no-one seemed to explain was how to fit a single tree to minimize an arbitrary differentiable loss function.

The answer was to be found in Elements of Statistical Learning Second Edition on page 358-9 in sections Steepest Descent and Gradient Boosting (Note how it’s 22 pages into the chapter on Gradient Boosting):

Take the prediction of your current model on the data you’re fitting as vector, and then take a single gradient step from that vector to minimize your loss function.

Subtract your original model output from this vector. You now have a residual for your arbitrary differentiable loss function

Fit a tree to that residual using a squared error loss, which has a simple non-gradient solution.

You have now done a single gradient step using a tree.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade