Incremental Learning with XGBoost

Kiran Kumar
3 min readJun 17, 2024

--

In machine learning and computer science incremental learning is an approach where a model is trained on new data while the model still retains its learnings from older data. This is very useful when the input data naturally becomes available over time or is too large to fit in available memory for training.

The two main advantages of incremental learning are :

  1. Efficient use of resources: as the data volumes for training are small, incremental learning can help save resource costs and train models fast
  2. Adapt to evolving patterns : since the models get trained on data that is newly available it can adapt to the new patterns found in it.

XGBoost has been one of the best performing algorithms for many tasks involving tabular data. It is one of the prime candidates to train while dealing with classification and regression problems involving tabular data.

In this article , we will explore how to perform incremental learning for XGBoost models and insights.

in the next article, we will explore the impact of incremental learning on the latency of the model predictions where we will see that the latency increases linearly with the number of incremental training iterations.

Incremental Learning in XGBoost

Incremental Learning in XGBoost is done by continuing to train new gradient boosted trees/estimators on newly available data in addition to the existing estimators.

i.e. Lets say you have a dataset D1 on which you have trained an XGBoost model xgb1, and after some time period you now have d2 that contains new data that the model has not been trained on, you can simply train an updated model xgb2 that is obtained by starting from xgb1 and trains new estimators using d2.

Of course you need to make sure that d2 is on the same feaure space as D1 and so on. For that you may have to run data processing and feature engineering steps before starting the training iterations.

At a high level this is how it looks like

# high level psuedo code 
# on day 1
xgb1 = xgb.fit(D1)
xgb1.tofile("filename_xgb_1.model")
predictions = load_model("filename_xgb_1.model").predict_proba(new_data)
#____________________
# on day 30
xgb1 = load_model("filename_xgb_1.model")
xgb2 = xgb.fit(d2,xgb1)
xgb2.tofile("fileanme_xgb2.model")
predictions = load_model("filename_xgb_2.model").predict_proba(new_data)
#__________________________
# on day 60
xgb2 = load_model("fileanme_xgb2.model")
xgb3 = xgb.fit(d3,xgb2)# and so on
xgb2.tofile("fileanme_xgb3.model")
predictions = load_model("filename_xgb_3.model").predict_proba(new_data)

#______________________________
## continue as needed

an example code to run it yourself

## refer 
## https://xgboost.readthedocs.io/en/stable/python/examples/continuation.html
## for more examples
import os
import pickle
import tempfile

from sklearn.datasets import load_breast_cancer

import xgboost

## this function contains 1 ieration of
## inc.learning or continuation of learning is its called here
## for a more generalized implementation you can persist the model output and
## build appropriate helper functions to orchestrate the training
def training_continuation(tmpdir: str, use_pickle: bool) -> None:
"""Basic training continuation."""
# Train 128 iterations in 1 session
X, y = load_breast_cancer(return_X_y=True)
clf = xgboost.XGBClassifier(n_estimators=128)
clf.fit(X, y, eval_set=[(X, y)], eval_metric="logloss")
print("Total boosted rounds:", clf.get_booster().num_boosted_rounds())

# Train 128 iterations in 2 sessions, with the first one runs for 32 iterations and
# the second one runs for 96 iterations
clf = xgboost.XGBClassifier(n_estimators=32)
clf.fit(X, y, eval_set=[(X, y)], eval_metric="logloss")
assert clf.get_booster().num_boosted_rounds() == 32

# load back the model, this could be a checkpoint
if use_pickle:
path = os.path.join(tmpdir, "model-first-32.pkl")
with open(path, "wb") as fd:
pickle.dump(clf, fd)
with open(path, "rb") as fd:
loaded = pickle.load(fd)
else:
path = os.path.join(tmpdir, "model-first-32.json")
clf.save_model(path)
loaded = xgboost.XGBClassifier()
loaded.load_model(path)

clf = xgboost.XGBClassifier(n_estimators=128 - 32)
clf.fit(X, y, eval_set=[(X, y)], eval_metric="logloss", xgb_model=loaded)

print("Total boosted rounds:", clf.get_booster().num_boosted_rounds())

assert clf.get_booster().num_boosted_rounds() == 128

If you happened to have a dataset that is too large to fit into memory, store and or read it in parts and using the following steps:

from xgboost import XGBClassifier


def read_my_data(lower_index,upper_index):
# your logic to get small batches here


# add estimator with best params from HPO results
estimator = XGBClassifier(max_depth=3,eta=0.3,gamma=1,colsample_bytree=0.8,
min_child_weight=1,subsample=0.8,n_estimators=50)
training_iterations = 10
for i in range(0,training_iterations):
lower_index = i*10000
upper_index = (i+1)*10000
X, y = read_my_data(lower_index,upper_index)
estimator.fit(X, y xgb_model= prev_model = estimator if i > 0 else None)

For simple examples and documentation

  1. https://xgboost.readthedocs.io/en/stable/python/examples/continuation.html
  2. https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.training
  3. https://shunya-vichaar.medium.com/incremental-learning-in-xgboost-b3eac6135ce#:~:text=By%20embracing%20incremental%20learning%2C%20XGBoost,to%20changing%20patterns%20or%20trends.

--

--