Building a tunable and configurable custom objective function for XGBoost

Ido Finder
AppsFlyer Engineering
11 min readMar 16, 2023

--

Extreme Gradient Boosting, or XGBoost, has gained immense popularity in recent years for solving a wide range of problems involving tabular data. With its ability to handle missing values, feature selection, and parallel processing, XGBoost has emerged as a top choice among data scientists and machine learning practitioners.

This blog post will focus on building advanced custom objective functions; we will discuss a few best practices for implementing them and integrating them into the hyperparameter tuning process.

XGBoost and objective functions

This post will not cover the full mechanism of XGBoost or dive deep into its mathematics. However, one concept worth mentioning is how XGBoost uses its defined loss function. Basically, XGBoost uses an approximation of the loss function based on Taylor’s expansion up to the second derivative. Then, when minimizing the approximated loss, the relevant equations required for building the trees can be done using the first and second derivatives, also known as Gradient and Hessian, respectively. Don’t worry, we will demonstrate how to use the gradient and hessian in this blog post. (visit here to learn more about the XGBoost mechanism).

The XGBoost library comes with several built-in objective functions, each designed for a different use case. For example, the “reg:squarederror” objective is usually used for regression problems. The “binary:logistic” objective, on the other hand, is used for binary classification problems (more objective functions can be found in the XGBoost’s documentation).

Why use built-in objective functions? Well, for one thing, they can be easily defined in the hyperparameters configuration of the XGBoost model. With that being said, if we decide to use a built-in objective function, we must make sure that the chosen objective is the right one for our predictive task. Otherwise, the training will not yield a very useful model.

The biggest drawback of using built-in objectives is that they aren’t always capable of dealing with complex problems. Occasionally, business requirements might require us to focus on specific KPIs, making the built-in objective not suitable.

One way of overcoming such challenges is to build a custom objective function that follows the required business needs. The ability to customize the objective function of the XGBoost makes it a very useful tool for solving unique and complex problems while leveraging the power and ease of use of the entire XGBoost library.

Is hyperparameter tuning relevant to our objective function?

The answer is yes. XGBoost has a large number of hyperparameters that can be tuned to optimize the model’s performance, such as the learning rate, maximum depth of trees, and regularization strength. Choosing the right ones can improve the model’s predictive power and reduce overfitting.

But what about custom objective functions? Well, in some cases, objective functions might contain their own hyperparameters that need to be tuned. One example is the “reg:tweedie” objective, which has the “tweedie_variance_power” parameter that controls the variance power of the distribution. While the “tweedie_variance_power” is a built-in parameter that can be defined easily via the model’s parameters dictionary, when constructing a custom objective with new custom parameters, things become more complicated.

Being able to tune these custom objective functions allows data scientists to tailor their models to their specific needs and optimize performance for their specific use cases. In the following sections, we will explain how to tune new hyperparameters that we’ve added to our custom objective function, thus making our XGBoost models better than ever!

Implementing custom objective function

Let’s start by understanding how to implement our own simple objective function for our XGBoost model. For simplicity, let’s look at the squared error objective that tries to minimize the mean squared error metric (MSE), which is defined by the following equation:

Fig. 1 — y_pred and y_actual denote the vector of predictions and labels respectively, n is the number of samples in the data

To implement it as an objective function, we will need to calculate the first and second derivatives of the squared error term w.r.t y_pred:

import numpy as np
import xgboost as xgb
from typing import Tuple

def my_squared_error(y_pred:np.ndarray,
dtrain:xgb.DMatrix) -> Tuple[np.ndarray, np.ndarray]:
y_true = dtrain.get_label()
grad = 2*(y_pred - y_true)
hess = np.repeat(2, y_true.shape[0])
return grad, hess

grad represents the first-order derivative of the squared error term w.r.t to y_pred, and hess is the second-order derivative.

A few things to keep in mind regarding the objective function:

  1. The order of the function’s argument must remain as shown in the code, starting with the y_pred vector followed by dtrain which is a DMatrix of the train data.
  2. The function must return two arguments, the first-order derivative of the loss term (gradient) and the second-order derivative (Hessian).
  3. Pay attention to the sign of the gradients and hessians according to the position of the y_pred vector in the error term.
  4. The grad and hess variables are vectors with length of the number of samples in the train data

In order to train the XGBoost model with our implementation of the objective function, we can pass our function to the train method via the “obj” argument as follows:

params = {
"max_depth":5,
"eta":0.15,
"eval_metric":"rmse"
}

model = xgb.train(params=params,
dtrain=dtrain,
num_boost_round=100,
obj=my_squared_error)

Let’s look at the following synthesized example were the target variable is users’ life-time value (LTV) in USD($), fitting our model to this data resulted in the following plot:

Fig. 2— X-axis represents the actual values in $, Y-axis represents the predicted values in $ and the 45 degree red line represent a perfect result

The above plot shows that our own implementation of the squared error objective function worked as expected. While trying to minimize the MSE, the predictions are pretty much symmetrically scattered around the 45-degree line without any visible success predicting the high LTV users. That was pretty simple, right?

Taking the custom objective to the next level

Now, let’s make things more complicated. Assuming fitting a simple regressor to our data is not enough, and we want to build an objective function that follows certain business needs. For example, let’s define a “valuable user” as one that spent more than a certain amount of $. Our goal is to avoid underestimating the potential revenue for these users as much as possible.

We can write the equation of our new asymmetric custom loss as follows:

Fig. 3delta denotes the level of “punishment” for under-predicting valuable users, and tau denotes the amount of $ over which a user is defined as valuable

Before jumping into the objective function implementation, it is important to note two important aspects:

  1. Two new variables have been added to the objective function, in addition to the y_pred vector and the dtrain object. The first variable is tau, which is the threshold defining a “valuable user”. This variable is configurable and decided based on business logic. The second variable is delta which is the level of “punishment” for under-predicting the “valuable users”. The delta is a new hyperparameter and its value should be decided based on the hyperparameter tuning procedure.
  2. The new variables cannot be passed to the function at the current setup since the custom objective function must remain with the structure presented earlier. The “obj” argument of the XGBoost’s train method, once invoked, will expect a Callable object with the exact structure as mentioned above.

To overcome this challenge, we will leverage Python Closures functions and wrap the implementation of the gradient and hessian with another function that takes as arguments the additional variables.

def my_assymetric_error_wrapper(tau, delta):
def my_assymetric_error(y_pred, dtrain):
y_true = dtrain.get_label()
error = (y_pred - y_true)
grad = np.where(((y_true>tau)&(error<0)), delta*2*error, 2*error)
hess = np.where(((y_true>tau)&(error<0)), delta*2, 2)
return grad, hess
return my_assymetric_error

Now, we can easily use our new asymmetric custom objective within the train method of the XGBoost while the values of tau and delta can be set outside of the function. This technique enables to either configure a parameter (tau) or to tune one (delta). For example purposes, let’s set the tau=8$ and delta=10:

params = {
"max_depth":5,
"eta":0.15,
"eval_metric":"rmse"
}

model = xgb.train(params=params,
dtrain=dtrain,
num_boost_round=150,
obj = my_assymetric_error_wrapper(tau=8, delta=10))

The training will now be based on our new custom objective logic while using the defined values of tau and delta. Fitting the model on our data resulted in the following:

Fig. 4— Presents the effect of the new asymmetric custom objective function on the model’s predictions.

We can see that the new asymmetric custom objective did a good job of avoiding (as much as possible) under-estimating the valuable users (revenue > 8$) while behaving pretty much the same way as the “squarederror” objective for the rest of them.

Tune the new hyperparameter

Following our previous example, we can understand that tau is a configurable parameter that can be determined based on some business logic. On the other hand, determining the value of delta is much harder as it behaves more like a hyperparameter of the model. Therefore, the delta parameter needs to be tuned together with the other hyperparameters of the model.

But should we worry about the fact that delta is a new hyperparameter and it is not part of the built-in parameters of the XGBoost library? Not at all, since we used Python Closures to implement our new custom objective function, we are now able to pass any value to our new parameter. Thus, we can leverage this capability in a hyperparameters tuning procedure and optimize our model’s performance.

In this blog post, we chose to use Optuna, which is an open-source hyperparameter optimization framework. It provides several sampling and pruning algorithms and supports various machine learning and deep learning frameworks. One of the key advantages of Optuna is that it is highly customizable, allowing users to define their own search spaces, objective functions, and optimization strategies. Optuna also supports parallel and distributed computing, making it scalable and able to handle large-scale hyperparameter optimization tasks.

Custom objective function + Optuna

The first step in working with Optuna is to implement the optimization objective, which is the core logic of the optimization process. It is constructed of three main components:

  1. Setting the hyperparameters sampling space. Here we can define which hyperparameters we want to tune and their value space. In our case, besides setting the XGBoost built-in hyperparameters, we will set a sampling space for our new hyperparameter delta. Notice that we didn’t define the delta parameter in the XGBoost param dict since it’s not a built-in hyperparameter, and it is being fed to the model through the custom objective function.
  2. Training the XGBoost model. Here we will initialize the training using our new asymmetric custom objective where the delta parameter is changing based on the optimization process.
  3. Defining a score metric. This score will be used to determine the best trial of the tuning process.

We will use Python closures again to wrap the objective with another function that will get as an arguments, the train and validation set, and also the configurable variable of our new XGBoost’s custom objective, the tau.

def objective_wrapper(dtrain, dval, tau):
def objective(trial):
# 1. setting parameters sampling space
param = {
"subsample" : trial.suggest_float("subsample", 0.0, 1.0),
"max_depth" : trial.suggest_int("max_depth", 1, 9),
"eta" : trial.suggest_float("eta", 0.005, 0.35),
"gamma" : trial.suggest_float("gamma", 1, 10.0),
}

# our new hyper-parameter
delta = trial.suggest_int("delta", 1, 50, step=5)

# 2. model training using the custom objective
model = xgb.train(param,
dtrain,
num_boost_round = 100,
obj = my_assymetric_error_wrapper(tau=tau,
delta=delta))

# 3. calculate score on validation set
preds = model.predict(dval)
labels = dval.get_label()
mse = np.mean((preds-labels)**2)
return mse
return objective

Now that the optimization objective function is built, we can initialize the Optuna study. When initializing a study, it is possible to choose a specific sampler and pruners, and of course to set the direction of optimization. For simplicity, we will use the default TPE sampler, and since we implemented the MSE as our evaluation score, we will set the optimization direction to “minimize”. We set the tau=8$ and optimized for 10 trials:

import optuna

study = optuna.create_study(sampler=optuna.samplers.TPESampler(seed=42),
direction='minimize')
study.optimize(objective_wrapper(dtrain, dval, tau=8), n_trials=10)

The outputs of the tuning process will look something like this:

[I 2023-02-15 09:17:36,163] A new study created in memory with name: demo
[I 2023-02-15 09:17:37,556] Trial 0 finished with value: 0.8846970796585083 and parameters: {'subsample': 0.3745401188473625, 'max_depth': 9, 'eta': 0.25753790992493475, 'gamma': 6.387926357773329, 'delta': 6}. Best is trial 0 with value: 0.8846970796585083.
[I 2023-02-15 09:17:37,960] Trial 1 finished with value: 2.7566282749176025 and parameters: {'subsample': 0.15599452033620265, 'max_depth': 1, 'eta': 0.3038307702923526, 'gamma': 6.41003510568888, 'delta': 36}. Best is trial 0 with value: 0.8846970796585083.
[I 2023-02-15 09:17:38,749] Trial 2 finished with value: 5.54156494140625 and parameters: {'subsample': 0.020584494295802447, 'max_depth': 9, 'eta': 0.2921927110761455, 'gamma': 2.9110519961044856, 'delta': 6}. Best is trial 0 with value: 0.8846970796585083.
[I 2023-02-15 09:17:39,232] Trial 3 finished with value: 1.1226681470870972 and parameters: {'subsample': 0.18340450985343382, 'max_depth': 3, 'eta': 0.18604096891312205, 'gamma': 4.887505167779041, 'delta': 11}. Best is trial 0 with value: 0.8846970796585083.
[I 2023-02-15 09:17:39,716] Trial 4 finished with value: 1.5384970903396606 and parameters: {'subsample': 0.6118528947223795, 'max_depth': 2, 'eta': 0.10578990374465026, 'gamma': 4.297256589643226, 'delta': 21}. Best is trial 0 with value: 0.8846970796585083.
[I 2023-02-15 09:17:40,218] Trial 5 finished with value: 0.8311933279037476 and parameters: {'subsample': 0.7851759613930136, 'max_depth': 2, 'eta': 0.182410881252696, 'gamma': 6.331731119758382, 'delta': 1}. Best is trial 5 with value: 0.8311933279037476.
[I 2023-02-15 09:17:40,709] Trial 6 finished with value: 2.2779664993286133 and parameters: {'subsample': 0.6075448519014384, 'max_depth': 2, 'eta': 0.02744279957992143, 'gamma': 9.539969835279999, 'delta': 46}. Best is trial 5 with value: 0.8311933279037476.
[I 2023-02-15 09:17:41,295] Trial 7 finished with value: 1.5023095607757568 and parameters: {'subsample': 0.8083973481164611, 'max_depth': 3, 'eta': 0.03869687933220243, 'gamma': 7.158097238609412, 'delta': 21}. Best is trial 5 with value: 0.8311933279037476.
[I 2023-02-15 09:17:41,870] Trial 8 finished with value: 1.3249372243881226 and parameters: {'subsample': 0.12203823484477883, 'max_depth': 5, 'eta': 0.016864039784750345, 'gamma': 9.18388361870904, 'delta': 11}. Best is trial 5 with value: 0.8311933279037476.
[I 2023-02-15 09:17:42,435] Trial 9 finished with value: 0.8856847286224365 and parameters: {'subsample': 0.662522284353982, 'max_depth': 3, 'eta': 0.18442346730634473, 'gamma': 5.920392514089517, 'delta': 6}. Best is trial 5 with value: 0.8311933279037476.

You can see that alongside the built-in hyperparameters like “max_depth” and “eta”, our new hyperparameter delta was tuned as different possible values of it were evaluated.

Further examining the output logs, we can see that the chosen value of delta in the best trial (trial #5) was 1. Meaning that we don’t apply any extra “punishment” for under-predicting valuable users. How is it possible? Well, that leads me to the next point, which is that when using a custom objective that follows a certain business logic, it’s important to align our evaluation to it. Here, since we used MSE as our evaluation score in the tuning process, it makes sense that the chosen value of delta is 1 since it converts our custom objective function back to the normal “squarederror” which is designed to minimize MSE. Isn’t that cool?

Thus, it is important to define an evaluation score that reflects the same business logic as the custom objective function. Aside from the evaluation score of the tuning process, the “eval_metric” parameter of the XGBoost should also follow the same business logic since it affects the early_stopping mechanism of the training process. How can we do that? It will be saved for a future blog post.

You’re now ready to create your own awesome custom objective function and to tune all of its new hyperparameters!

References

--

--