The forgotten initial estimator in GBMs

Krzysztof Joachimiak
6 min readJun 24, 2023

--

Trying to find the lost estimator among the trees. Image from Freepik

Gradient Boosting Machines are an excellent complementary tool in the world dominated by neural networks. They succeed on relatively small, tabular datasets on which the neural networks underperform.

There is one hyperparameter in these models, commonly known as n_estimators, which may be quite misleading. Typically, when the number of estimators equals X, the real number of all the estimators is actually X+1. In this article, we’ll try to extract this silently omitted, the very first weak estimator in the GBM model.

The initial estimator

The way the GBMs work is succinctly expressed in the picture with a golfer on explained.ai. During the training, we create a series of weak models, one by one. The weak model that is being fitted tries to improve the joint results of its predecessors. Typically, the initial naïve estimation we start with is:

  • regressors: the average of the training targets
  • classification: log-odds, logarithm of the relative class frequency¹⁰

Average not necessary means… mean :). If we’re optimizing MSE, we’re looking for a conditional mean, so calculating the mean of the training targets in the first initial step is a good choice. For MAE, we’d rather like to calculate median etc. Because of its character, the initial estimator is simply referred to as model bias. It’s named that way for example in the CatBoost and XGBoost documentation.

Equation based on the similar one found in the CatBoost docs

XGBoost docs correctly states that:

For sufficient number of iterations, changing this value will not have too much effect.

But who knows, what is the sufficient number of iterations? So the choice of the initial guess still may matter.

How it’s done in GBM libraries?

XGBoost

In XGBoost, we can manually define the base_score/ base_margin hyperparameter. Confusingly, it was set as 0.5, even for regression (although it looks like a good starting value for binary classification).

StatQuest on XGBoost regression with base_score=0.5

According to the XGBoost documentation, now the base_score

(…) is automatically estimated for selected objectives before training. To disable the estimation, specify a real number argument.

The library doesn’t expose any nice interface to fetch this value, but it’s still possible, unlike in LightGBM. Or, to be more precise, we actually can access the base_score attribute, but if we don’t pass a custom value, it stores None. Digging in the GitHub repo, we’re coming across a useful get_basescore function. And it puzzles me — in theory, XGBoost automatically estimates a relevant initial guess, so we should have e.g. mean target value for MSE loss. We even find C++/CUDA code, that seems to do that. But as I checked, the base_score extracted from the bowels of XGBRegressor equals 0.5. Maybe it’s time to report an issue…? I used the newest (1.7.6) XGBoost version when testing it.

UPDATE 19th October 2023

Starting from the 2.0.0 version, XGBoost estimates the base_score instead of using the predefined one. See the fragment below from the release notes.

Automatically estimated intercept

In the previous version, base_score was a constant that could be set as a training parameter. In the new version, XGBoost can automatically estimate this parameter based on input labels for optimal accuracy. (#8539, #8498, #8272, #8793, #8607)

import json
import xgboost as xgb

# See: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/testing/updater.py
def get_basescore(model: xgb.XGBModel) -> float:
"""Get base score from an XGBoost sklearn estimator."""
base_score = float(
json.loads(model.get_booster().save_config())["learner"]["learner_model_param"][
"base_score"
]
)
return base_score

# Training a model
xgb_reg = xgb.XGBRegressor()
xgb_reg.fit(X_train, y_train)

# It seems it always returns... 0.5
# At leats as long as we don't set a custom value manually
# with xgb.XGBRegressor(base_score=np.mean(y_train))
get_basescore(xgb_reg)

CatBoost

CatBoost also uses the baseline estimator and considers it to be the bias of the model. We can easily get this value using get_scale_and_bias method.

import numpy as np
from catboost import CatBoostClassifier, CatBoostRegressor

# True by default
cb_clf = CatBoostClassifier(boost_from_average=True, verbose=False)
cb_reg = CatBoostRegressor(boost_from_average=True, verbose=False)

# Fitting
cb_clf.fit(X_train, y_train)
cb_reg.fit(X_train, y_train)

# Returning scale and bias
cb_clf.get_scale_and_bias()
# (scale, bias)
# bias is the zero estimat

# When boost_from_average=False, bias equals zero
CatBoostRegressor(boost_from_average=False, verbose=False)\
.fit(X_train, y_train) \
.get_scale_and_bias()
# (1.0, 0.0)

# Compare biases calculated for
# different loss functions
mean_bias = \
CatBoostRegressor(
boost_from_average=True,
objective='RMSE'
)\
.fit(X_train, y_train) \
.get_scale_and_bias()

np.isclose(mean_bias, np.mean(y_train))

# Surprisingly, the average for MAE is not equal to median
median_bias = \
CatBoostRegressor(
boost_from_average=True,
objective='MAE',
verbose=False
)\
.fit(X_train, y_train) \
.get_scale_and_bias()[1]

np.isclose(median_bias, np.median(y_train))

LightGBM

If boost_from_average=True , it applies the average as the first estimator. At the moment, it’s only used with regression (MSE), binary, multiclassova and cross-entropy loss. Unfortunately, despite of leveraging this technique, we cant’t extract the initial value. It was explained in a GitHub issue.

(…) E.g. for a binary prediction problem where the training labels are 90% ones and 10% zeros, we would start with a constant prediction of 0.9 and then add trees to improve the accuracy. This is indeed how LightGBM works, and this constant value is added to the leaf values of the first tree. So if you use Booster.save_model() (https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html?highlight=save%20model#lightgbm.Booster.save_model) the leaf values for the first tree include this baseline value.
The relevant part of the C++ code is in TrainOneIter

LightGBM/src/boosting/gbdt.cpp

In the first iteration (when gradients and hessians are nullptr), it calls BoostFromAverage which calculates the constant initial prediction. Then it calculates the optimal tree (fitting to the error from the constant prediction), and later it calls AddBias to add the constant to the individual leaf values.

It was an answer written on December 2020, but I verified this information with the current version and it still works that way. So what should we do if we’d like to calculate this value? We have to compute it on our own. Still, we have to remember that the initial naïve model is only used for a limited set of popular loss functions I’ve named above. See the relevant parts of piece of C++ code.

from lightgbm import LGBMClassifier, LGBMRegressor
import numpy as np

# boost_from_average hyperparameter
lgb_clf = LGBMClassifier(boost_from_average=...)
lgb_reg = LGBMRegressor(boost_from_average=...)

# Fitting - the initial naïve model is fitted for example
lgb_clf.fit(X_train, y_train)
lgb_clf.fit(X_train, y_train)

# We only can calculate the mean "by hand"
y_mean = np.mean(y_train)

scikit-learn

By default, it takes the value of the init_estimator attribute from the object passed as a loss function. In all the cases it’s implemented using DummyRegressor or DummyClassifier. Browse the sklearn’s repo to learn the full list of loss functions’ initial estimators.

from sklearn.ensembles import \
GradientBoostingClassifier, GradientBoostingRegressor

# loss and init_estimator are two arguments, which control the choice of
# initial estimator
gbm_clf = GradientBoostingClassifier(loss=..., init_estimator=...)
gbm_reg = GradientBoostingRegressor(loss=..., init_estimator=...)

# Fitting
gbm_clf.fit(X_train, y_train)
gbm_reg.fit(X_train, y_train)

# You can access the initial estimator with init_ attribute
gbm_clf.init_
gbm_reg.init_

# It only returns the list of the tree estimators
gbm_clf.estimators_
gbm_reg.estimators_

See also

Extracting trees from GBM models as data frames | by Krzysztof Joachimiak | Jun, 2023 | Medium

References

  1. Some Details on Running xgboost, 2019
  2. Pitfalls of Incorrectly Tuned XGBoost Hyperparameters, 2022
  3. xgboost : The meaning of the base_score parameter — Stack Overflow
  4. offset — base_margin or init_score for catboost regressor — Stack Overflow
  5. How to get base_score from trained booster — XGBoost
  6. How to get individual tree’s prediction value for XGBoost Regressor? — Stack Overflow
  7. How exactly LightGBM predictions are obtained? — GitHub issue
  8. Using base_score with XGBClassifier to provide initial priors for each target class — StackOverflow
  9. XGBoost — Prediction
  10. Multiclass gradient boosting: how to derive the initial guess, how to predict a probability- Cross Validated
  11. Gradient Boosting Classification Supervised Learning Algorithms, Maël Fabien

--

--