Generating features with gradient boosted decision trees

4 min readApr 19, 2023

--

Main part of the image from iStock, a compilation by author

I’m not the first person, who publishes an article on that topic on Medium. There is already at least one similar article by Carlos Mougan, which I kindly recommend you to get familiar with. My goal is not solely to remind you about this interesting technique and explain once again, how does it work. I’d also like to introduce my library: scikit-gbm, a package intended to gather multiple additional tools to work with GBM models. The first one I’ve placed there is GBMFeaturizer, a scikit-learn compatible implementation of the GBM feature extractor.

Algorithm

The main source on that algorithm we can find is a paper about ad clicks prediction². If you are surprised we can extract features from a GBM, in a somewhat similar manner as from neural networks, you’ll realize this algorithm is really simple. First, let’s recall what the tree-based GBM consists of. Such a model is an ensemble of multiple decision trees, trained sequentially using gradient boosting algorithm, results of which are then combined (added) to create the final output. As it is presented in the image below, each tree has a number of leaves, and the output of a tree is always generated by exactly one of those terminal regions. This means, a single tree can be considered as a classifier, which divides the datasets into multiple bins.

A GBM model with three trees: A, B and C

In the picture above we can see a simplified example of a GBM model with three trees. We feed the model with one example, which results in traversing all the trees and falling in one of leaf of each of those trees. We get then three categorical features (columns A, B and C corresponding to each particular tree), which have 3, 2 and 3 possible values respectively. Sticking to our example, after this transformation, we get a vector [2, 1, 3] (or [1, 0, 2], depending on indexation) assigned to the sample we passed trough the model. After transforming all the input samples we obtain a matrix with shape (n_samples, n_trees). Typically, in order to handle those features in the next step, we turn them into binary matrices using one-hot encoding. Below I place a similar visuliazation, taken from the paper about ad clicks prediction².

Image and description by authors of *Practical Lessons from Predicting Clicks on Ads at Facebook*²

Generating features with scikit-gbm

Unlike the other implementations I found (sktools³ and xgboostExtension⁴), mine handles most of popular gradient boosted decision tree implementations, namely:

GBMs from scikit-learn
XGBoost
LightGBM
CatBoost

You can choose the model you want, so this wrapper is very convenient when trying many different GBM models, running them in the loop etc. It takes three arguments:

estimator — an instance of one of the abovementioned classes
one_hot — apply one-hot encoding to the ouput
append — append the extracted features to the original input

The answer to the crucial question “Is it worth to generate such features?”, depends, as always, on the particular dataset, so you have to test it on your own.

Installation

pip install scikit-gbm

# or 

pip install git+https://github.com/krzjoa/scikit-gbm.git

Classification

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

from skgbm.preprocessing import GBMFeaturizer
from xgboost import XGBClassifier

from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)


# Try also:
# ('gbm_featurizer', GBMFeaturizer(GradientBoostingClassifier())),
# ('gbm_featurizer', GBMFeaturizer(LGBMClassifier())),
# ('gbm_featurizer', GBMFeaturizer(CatBoostClassifier())),

pipeline = \
    Pipeline([
        ('gbm_featurizer', GBMFeaturizer(XGBClassifier(), 
                                         one_hot=True, append=False)),
        ('logistic_regression', LogisticRegression())
    ])

# Training
pipeline.fit(X_train, y_train)

# Predictions for the test set
pipeline_pred = pipeline.predict(X_test)

Regression

from sklearn.datasets import fetch_california_housing
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from skgbm.preprocessing import GBMFeaturizer
from lightgbm import LGBMRegressor

X, y = fetch_california_housing(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y)

# Try also:
# ('gbm_featurizer', GBMFeaturizer(GradientBoostingRegressor())),
# ('gbm_featurizer', GBMFeaturizer(XGBRegressor())),
# ('gbm_featurizer', GBMFeaturizer(CatBoostRegressor())),

pipeline = \
    Pipeline([
        ('gbm_featurizer', GBMFeaturizer(LGBMRegressor(), 
                                         one_hot=True, append=True)),
        ('linear_regression', LinearRegression())
    ])

# Training
pipeline.fit(X_train, y_train)

# Predictions for the test set
pipeline_pred = pipeline.predict(X_test)