The Monotonic, Darting, Best-First Boosted Tree

Some nuanced features to the XGBoost algorithm

Published in

Data & Waffles

8 min readJul 8, 2019

This article assumes an intermediate level familiarity with Gradient Boosted Trees and specifically the XGBoost implementation.

The Gradient Boosted Decision Tree (GBDT) has long been the de-facto technique for achieving best-in-class machine learning results on structured data. It is a machine learning technique which combines multiple weak decision trees, improving each prediction sequentially to generate an improved and optimized output. They have been used very extensively in the competitive learning scene such as in Kaggle or Analytics Vidhya etc and often form part of stacked ensembles that make up winning solutions. One famous GBDT implementation is known as XGBoost, introduced by a research group led by Chen Tianqi from the University of Washington and until recent years sat at the top of the boosting food chain (now arguably occupied by Microsoft’s LightGBM).

From my own experience working as a data scientist in a multinational bank dealing mainly with tabular datasets, as well as from past Kaggle experience, XGBoost has been my go to technique in terms of getting relatively good results in a short period of time, without the need for heavy hyperparameter tuning (same can’t be said for LightGBM). However even though much of the merits of XGBoost is in its off-the-shelf use, understanding some of the more nuanced tricks within the algorithm will definitely enrich a data scientist’s arsenal. Therefore this article was put together with the aim of exploring three such nuances within XGBoost:

Enforcement of Monotonic constraints: so that input and target variable strictly varies in one direction.
Best-first tree growth policy: where nodes in the tree are expanded in the order that will reduce most loss throughout.
Dropouts in Tree boosting: a.k.a DART booster, where dropout regularization is used in Boosted trees.

1. Monotonic Constraints

We first start off by exploring the ability of XGBoost to constrain the relationship learned between an input variable and target variable such that it is only ever non-decreasing (monotonically increasing) or non-increasing (monotonically decreasing), which can change at a defined rate.

Monotonic constraints are especially appealing in highly regulated industries where regulators and domain experts might require monotonic behaviour between the variables being modeled. One good example of this is in Credit Risk modeling where reason codes generated from scoring are expected to be directly interpretable and hence an enforcement on monotonicity.

To visually portray how monotonicity affects the learning process, we first start by generating a simple function which adds a sine function and a quadratic function to fit our XGBoost model on as shown in the code snippet below:

# Generating simple data points
x = np.linspace(0, 10, 1000)
y = 16*np.sin(x) + x**2 + 0.42*x — (20*np.random.random(1000))

Next we invoke a really deep XGBoost regressor model without monotonic constraints to fit on the generated data and plot the resultant predictions:

# Fit a really deep XGBoost regressor
xgb_overfitted = xgb.XGBRegressor( max_depth=30, n_estimators=50)
xgb_overfitted.fit(x.reshape(-1,1), y)
prediction = xgb_overfitted.predict(x.reshape(-1,1))# Plot the predictions against actual 
plt.figure(figsize=(12,6))
plt.scatter(x, y, c=’b’, s=1.5)
plt.plot(x, prediction, c=’r’, alpha=0.65)

The unconstrained predictions given by the red line undergo a very jagged pattern which evidently shows that the model has fitted very well (too well) to the training data. If a regulator were trying to interpret the relationship between the input and target variable, he would have a hard time due to the wild swings in the prediction behavior.

Therefore we now come onto the application of monotonic constraints which would be able to alleviate this. Beforehand one could run a Pearson correlation in the case of all numerical inputs or a box-plot heuristic if the inputs were categorical to inform the general trend. In our case, it is visually apparent that the trend is increasing and therefore we set our value in the parameters to +1 to signify a monotonically increasing trend as follows:

params = {‘monotone_constraints’: 1}

With the constraints applied, we fit the same model against the data and plot the predictions as follows:

xgb_monotonic = xgb.XGBRegressor(
                         max_depth=30, 
                         n_estimators=50, 
                         **params # monotonic constraints applied
)xgb_monotonic.fit(x.reshape(-1,1), y)
prediction = xgb_monotonic.predict(x.reshape(-1,1))plt.figure(figsize=(12,6))
plt.scatter(x, y, c=’b’, s=1.5)
plt.plot(x, prediction, c=’r’, alpha=0.65, label=”Predicted line”)
plt.xlabel(“x”)
plt.ylabel(“y”)
plt.title(“Overfitted XGBoost predictions with Montonic constraints\n”)
plt.legend()

Clearly the application of monotonic constraints serve to “smooth” out the jagged predictions thus ensuring that there is only an ever-increasing relationship making it much easier to explain this univariate model. One could also achieve this by controlling the tree-to-depth ratio or tweaking the regularization parameter of the algorithm, but monotonic constraints is a very easy and intuitive way to to combat overfitting.

However a drawback of applying monotonic constraints is that we lose a certain degree of predictive power as it will be more difficult to model subtler aspects of the data due to the constraints (e.g. not picking up the downward trend from approx. x=1 to x=5 in the generated data).

2. Best-First Tree growth

We come onto the second nuanced feature of XGBoost around how trees are grown during the learning process. The default policy of growing trees in XGBoost is via the Depth-wise approach where all nodes closest to the root node (i.e in the same row) are expanded first before moving onto the next level as shown in the diagram below:

Depth-wise tree growth policy: The number in the diagram shows the order in which nodes are evaluated.

The Best-first approach on the other hand expands the nodes based on the order that will reduce the most loss throughout the tree. This is the default way of growing trees in LightGBM and coupled with its own method of evaluating splits, why LightGBM can perform at the same level of XGBoost but in a much faster period of time. The diagram below shows the best-first growth policy:

Best-first tree growth policy: Nodes are evaluated first from left to right based on which split reduces loss the most.

Code-wise, growing the tree via the best-first method involves setting two very simple parameters when instantiating your XGBoost model.

Setting the evaluation of splits in the tree to that of the Histogram approximation method.
Using lossguide as the growth policy i.e. choosing splits that results in greatest change in loss.

params = {
    “tree_method”: “hist”,
    “grow_policy”: “lossguide”
}

The main drawback is that XGBoost models (or any others) grown in this manner are more prone to overfitting, therefore requiring more attention to hyperparameter tuning and cross-validation of the model.

3. DART booster: Dropouts meets tree ensembles

Picture taken from paper “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” : http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf

The famous dropout technique used in the regularization of Deep Learning networks has been adapted for use in gradient boosting trees and is available as an option in XGBoost as an alternative to the default shrinkage regularization. This works by dropping a fraction of trees that have been trained previously before in each training cycle, shown pictorially:

Each row represents a different index in the ensemble: 1st, 100th, 200th and 400th tree in the ensemble. In each tree, the size of nodes is proportional to the percentage of the instances that reach this node. The color gradient of leaves represent the range of values where green stands for the positive extreme, yellow for zero, and red for negative extreme. Picture taken from paper: “DART: Dropouts meet Multiple Additive Regression Trees” : https://arxiv.org/pdf/1505.01866.pdf

For XGBoost, dropout comes in the form of the DART tree booster option which is an acronym for Dropouts meet Multiple Additive Regression Trees.

As a benchmark, two XGBoost classifiers are developed where one is invoked with the default tree booster and the other with the DART booster. The dataset of choice for this comparison will be a toy diamond dataset taken from Kaggle containing historical prices and other physical attributes of approximately 54,000 diamonds as shown below:

Diamond dataset taken from Kaggle: https://www.kaggle.com/shivam2503/diamonds

Discounting all categorical variables in the dataset and only modelling on the numerical ones as a first start, we train an XGBoost regressor set to the default booster along with a large number of 1,000 trees to accentuate the dropout effect later on. This is fitted on the price column in our dataset which serves as our target variable.

# Fixing the target variable to be that of price 
target = df[“price”]cols_to_drop = [“Unnamed: 0”, “cut”, “color”, “clarity”, “price”]
df.drop(cols_to_drop, axis=1, inplace=True)# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,target, test_size=0.3, random_state=42)xgb_no_dart = xgb.XGBRegressor(max_depth=3,
                               n_estimators=1000,
                               booster=”gbtree”,
                               random_state=42,
                               n_jobs=3,
                               min_child_weight=4,
                               learning_rate=0.99)
# Training process
xgb_no_dart.fit(X_train, y_train)
# Predicting on the holdout
test_predict = xgb_no_dart.predict(X_test)
r2_test = r2_score(y_test, test_predict)
mae = mean_absolute_error(y_test, test_predict)
print(“R2 score on test data is {0}% with mean error of {1}”.format(round(r2_test*100,3), round(mae,2)))

Calculating metrics on a simple holdout test set using R2 scores and the mean absolute error, the default XGBoost booster gives:

R2 score on test data is 85.645% with mean error of 830.34

Now switching to the DART booster with a 5% drop rate whilst keeping the rest of the parameters unchanged:

xgb_dart = xgb.XGBRegressor(max_depth=3,
                            n_estimators=1000,
                            booster="dart",
                            rate_drop=0.05,
                            random_state=42,
                            n_jobs=3,
                            min_child_weight=4,
                            learning_rate=0.99)
# Training process
xgb_dart.fit(X_train, y_train)# Predicting on the test
test_predict = xgb_dart.predict(X_test)
r2_test = r2_score(y_test, test_predict)
mae = mean_absolute_error(y_test, test_predict)
print("R2 score on test data is {0}% with mean error of {1}".format(round(r2_test*100,3), round(mae,2)))

The new metrics with dropout are as follows:

R2 score on test data is 87.625% with mean error of 766.46

In our very simple and contrived example, we show that the use of the DART booster is able to on first glance increase the performance of the model and reduce the prediction error. However one major drawback of using dropouts is that the training is achingly slow compared to the default booster so once again, there is a trade-off to consider when applying these in practice.

Conclusion

This article has taken the reader through three nuances of the xgboost algorithm which either serve as regularization options in the case of monotonic constraints and dropouts-with-DART or for better predictive performance in the case of the best-first growth policy.

There are many other interesting nuances to XGBoost but there is simply not enough room to do all of them justice here. This article will however conclude with a couple that are definitely worth bringing to the reader’s attention as follows:

Feature interactions via the xgbfi package: This allows the user to rank variable importances by different metrics and more importantly to examine 2-way, 3-way, … , N-way interactions between features. Link to the github can be found here.
Tuning of regularization hyperparameters: The use of the hyperparameters Gamma and min_child_weight to control the bias-variance tradeoff. A very nice medium article written by Kaggle Master Laurae can be found here.

Peace out.