Discretizing features with gradient boosted decision trees

7 min readJun 17, 2023

Image from Freepik (by azerbaijan_stockers)

Discretization is one of the well-known techniques we can use when working with continuous features. It’s especially recommended to be applied along with linear models, because the non-linear ones are often able to do their job without any additional helpers. There is a multitude of methods to divide such features into set of bins. The simplest one assumes we split the total feature range into segments of equal length (e.g. KBinsDiscretizer with strategy=’uniform’). The more sophisticated approaches can be qualified as supervised algorithms, which means they takes into account relation between the variable being binned an the target (response variable). Examples are discretizers based on decision trees.

Discretization with decision trees

Fitting a decision tree to find an optimal set of binning thresholds can be considered as a smarter approach to perform discretization step. We can easily plug this algorithm into our scikit-learn pipeline using DecisionTreeDiscretiser from feature-engine library (it offers alternative discretizers, too). Before we get into discretizers based on ensembles of GBDT, let’s try first to understand, how does it work for a single decision tree.

Train a decision tree taking a single predictor as input (the variable being discretized). We need a separate model per each variable we want to bucketize.
Use decision tree predictions as buckets’ labels.

Discretization with decision tree is typically considered as something similar to target encoding

As we can see, this type of discretization is close to target encoding. Not to be groundless, let’s take a peek on two examples with aforementioned DecisionTreeDiscretiser. Original values get replaced with target variable predictions (which indicate a specific bucket in a one-feature decision tree). In the examples below, I don’t even use X_test, but you can continue the experiment and build a pipeline with the final regression/classification model.

Regression

# Regression - a modified example from feature-engine documentation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine.discretisation import DecisionTreeDiscretiser

# Preparing data
URL = 'http://jse.amstat.org/v19n3/decock/AmesHousing.xls'

data = pd.read_excel(URL)
data.columns = data.columns.str.replace(' ', '')

X_train, X_test, y_train, y_test =  train_test_split(
            data.drop(['PID', 'SalePrice'], axis=1),
            data['SalePrice'], test_size=0.3, random_state=0)

# Creating a discretizer
reg_disc = DecisionTreeDiscretiser(
        cv=3,
        scoring='neg_mean_squared_error',
        variables=['LotArea', 'GrLivArea'],
        regression=True
)

# Fitting the transformer
reg_disc.fit(X_train, y_train)

# Juxtaposing original and transformed data
pd.concat([
  # Original columns
  X_train[['LotArea', 'GrLivArea']].add_suffix('_orig'),
  # Transformed columns
  reg_disc.transform(X_train)[['LotArea', 'GrLivArea']].add_suffix('_disc')
], axis=1).sort_index(axis=1)

As we can see, the unit a particular variable is expressed in have changed. Now they both contain values referring to a certain level of the target variable.

Classification

from sklearn.datasets import load_iris
# Loading data
iris = load_iris()
# https://stackoverflow.com/questions/38105539/how-to-convert-a-scikit-learn-dataset-to-a-pandas-dataset
data = pd.DataFrame(
    data= np.c_[iris['data'], iris['target']],
    columns= iris['feature_names'] + ['target']
)
data.columns = data.columns.str[:-5]
data.columns = data.columns.str.replace(' ', '_')

# Data splitting
X, y = data.iloc[:, :4], data.iloc[:, 4:]
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.3, random_state=0)
X_cols = X.columns.tolist()

# Creating a discretizer
class_disc = DecisionTreeDiscretiser(
        cv=3,
        scoring='accuracy',
        variables=X_cols,
        regression=False
)

# Fitting the transformer
class_disc.fit(X_train, y_train)

# Original variables vs transformed ones
pd.concat([
  X_train[X_cols].add_suffix('_orig'),
  class_disc.transform(X_train)[X_cols].add_suffix('_disc')
], axis=1).sort_index(axis=1)

Original values has been replaced with probabilities of the class indexed with 1.

I’m a little bit puzzled with the output, because I expected as many new columns to replace original ones, as many classes we have in the target variable. I need to verify, if using one-class probability here is truly unambiguous. Nonetheless, this is how it’s implemented in the feature-engine library; Akash Dubey also does that the same way is his Medium article⁷.

Discretization with tree ensembles

I was wondering on the potential effects of discretization thresholds distilled from a tree ensemble trained with gradient boosting algorithm. It quickly turned out, I wasn’t the only one that thought about that.

Among the R’s recipes steps — something that can be seen as equivalent for scikit-learn pipeline and compatible transformers — I’ve found an interesting function in the embed⁵ package: step_discretize_xgb, authored by Konrad Semsch.

library(rsample)
library(recipes)
data(credit_data, package = "modeldata")

set.seed(1234)
split <- initial_split(credit_data[1:1000, ], strata = "Status")

credit_data_tr <- training(split)
credit_data_te <- testing(split)

xgb_rec <-
  recipe(Status ~ Income + Assets, data = credit_data_tr) %>%
  step_impute_median(Income, Assets) %>%
  step_discretize_xgb(Income, Assets, outcome = "Status")

xgb_rec <- prep(xgb_rec, training = credit_data_tr)

bake(xgb_rec, credit_data_te, Assets)
#> # A tibble: 251 × 1
#>    Assets     
#>    <fct>      
#>  1 [3000,4000)
#>  2 [3000,4000)
#>  3 [9500, Inf]
#>  4 [3000,4000)
#>  5 [-Inf,2500)
#>  6 [-Inf,2500)
#>  7 [-Inf,2500)
#>  8 [4000,4500)
#>  9 [-Inf,2500)
#> 10 [3000,4000)
#> # ℹ 241 more rows

In the example above, the Assets variable has been discretized and transformed into a variable of type factor (an R data type for categorical data). We don’t replace the variable with the model predictions. We retrieve all the splits from all the trees an then use it to create bin ranges.

In this scenario, we’re creating “typical” bins, i.e. categorical variables

A similar approach leveraging random forest has been also proposed and then adopted in the ForestDisc³ R library.

# A slightly modified example from the ForestDisc's documentation
library(ForestDisc)
library(tibble)

set.seed(1234)
data(iris)

id_target <- 5
iris_disc <- ForestDisc(iris,id_target, max_splits=10)
as_tibble(iris_disc$data_disc)

#> # A tibble: 150 × 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>   <fct>        <fct>       <fct>        <fct>       <fct>  
#> 1 (4.95,5.87]  (3.36, Inf] (-Inf,2.46]  (-Inf,0.75] setosa 
#> 2 (-Inf,4.95]  (2.71,3.12] (-Inf,2.46]  (-Inf,0.75] setosa 
#> 3 (-Inf,4.95]  (3.12,3.36] (-Inf,2.46]  (-Inf,0.75] setosa 
#> 4 (-Inf,4.95]  (2.71,3.12] (-Inf,2.46]  (-Inf,0.75] setosa 
#> 5 (4.95,5.87]  (3.36, Inf] (-Inf,2.46]  (-Inf,0.75] setosa 
#> 6 (4.95,5.87]  (3.36, Inf] (-Inf,2.46]  (-Inf,0.75] setosa 
#> 7 (-Inf,4.95]  (3.36, Inf] (-Inf,2.46]  (-Inf,0.75] setosa 
#> 8 (4.95,5.87]  (3.36, Inf] (-Inf,2.46]  (-Inf,0.75] setosa 
#> 9 (-Inf,4.95]  (2.71,3.12] (-Inf,2.46]  (-Inf,0.75] setosa 
#> 10 (-Inf,4.95]  (2.71,3.12] (-Inf,2.46]  (-Inf,0.75] setosa 
#> # ℹ 140 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Superficially, it works in a similar manner as step_discretize_xgb. Actually, it applies an additional algorithm to select the desired number of splits (defined with themax_splits argument).

GBMDiscretizer

I decided to implement the same functionality in Python (fully compatible with scikit-learn pipelines) and put it into my scikit-gbm package. The initial version works in the same way as step_discretize_xgb. I’m also considering making use of the split selection algorithm found in ForestDesc. For the moment, it uses all the unique splits learnt by the one-feature gradient boosted tree ensembles.

Regression

from skgbm.preprocessing import GBMDiscretizer
from sklearn.ensemble import GradientBoostingRegressor
# Select the GBM model you want
# from lightgbm import LGBMRegressor
# from xgboost import XGBRegressor
# from catboost import CatBoostRegressor

# Preparing data
URL = 'http://jse.amstat.org/v19n3/decock/AmesHousing.xls'

data = pd.read_excel(URL)
data.columns = data.columns.str.replace(' ', '')
X_cols = ['LotArea', 'GrLivArea']

X_train, X_test, y_train, y_test =  train_test_split(
            data.drop(['PID', 'SalePrice'], axis=1),
            data['SalePrice'], test_size=0.3, random_state=0)

# Fitting the discretizer & transforming the data
gbm_discretizer = GBMDiscretizer(GradientBoostingRegressor())
gbm_discretizer.fit_transform(X_train)[X_cols]

Features discretized with GBMDiscretizer and GradientBoostingRegressor

Classification

from xgboost import XGBClassifier
# Select the GBM model you want
# from lightgbm import LGBMClassifier
# from sklearn.ensemble import GradientBoostingClassifier
# from catboost import CatBoostClassifier

# Preparing data
iris = load_iris()
# https://stackoverflow.com/questions/38105539/how-to-convert-a-scikit-learn-dataset-to-a-pandas-dataset
data = pd.DataFrame(
    data= np.c_[iris['data'], iris['target']],
    columns= iris['feature_names'] + ['target']
)
data.columns = data.columns.str[:-5]
data.columns = data.columns.str.replace(' ', '_')

# Data splitting
X, y = data.iloc[:, :4], data.iloc[:, 4:]
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.3, random_state=0)
X_cols = X.columns.tolist()

# Fitting the discretizer & transforming the data
gbm_discretizer = GBMDiscretizer(XGBClassifier(), X_cols, one_hot=False)
gbm_discretizer.fit_transform(X_train, y_train)

Features discretized with GBMDiscretizer and XGBClassifier

As it was said at the beginning of this article: a typical scenario of applying this method is automated feature extraction, set of which is then used in an easily interpretable linear model.

Discretizing features with gradient boosted decision trees

Discretization with decision trees

Regression

Classification

Discretization with tree ensembles

GBMDiscretizer

Regression

Classification

See also

References

Written by Krzysztof Joachimiak