Open-sourcing ShapRFECV — Improved feature selection powered by SHAP

Mateusz Garbacz
ING Blog
Published in
9 min readDec 15, 2020

When building a machine learning model, selecting a suitable subset of features may significantly improve the performance of the final solution. This post introduces ShapRFECV, a new method for feature selection in decision-tree-based models that is particularly well-suited to binary classification problems.

This method, implemented in Python and now open-sourced, is based on a common algorithm: recursive feature elimination (RFE). However, by leveraging SHapley Additive exPlanations (SHAP) feature importance and supporting hyperparameter optimization, one can achieve more accurate and reliable results with ShapRFECV, with an API familiar to users of scikit-learn.

This method proved effective at improving the performance of the models in multiple projects at ING, especially in the credit risk domain. In one of the models, this technique was used to eliminate 60 out of 110 features, while slightly improving the AUC and significantly increasing the speed of the model. Based on our positive experience, we decided to open-source our implementation in the Probatus package in order to share the method with the data science community.

Photo by Trent Erwin

Feature selection

Feature selection is a crucial step in building your data science solution. It has the following properties:

  • Feature selection simplifies the model, making it easier to interpret.
  • Feature selection reduces the computational time required to train and predict.
  • Feature selection prevents the curse of dimensionality from negatively affecting the model performance,
  • Feature selection minimizes overfitting the data.

There are many different feature selection methods. Choosing the one most suitable for a given model depends on various factors, such as the target evaluation metric, the available computational power, and the task complexity.

Recursive Feature Elimination

RFE is a popular feature selection method for classification and regression models. This method starts with the full feature set and then iterates over the following steps, using a predictive algorithm:

  1. Fit the model and calculate feature importance. Depending on the implementation, this step might include use of cross validation (CV).
  2. Remove n features that have the lowest importance from the dataset, where n is a parameter.

The method uses feature importance as a proxy for overall contribution of a given variable to model’s performance, which leads to high efficiency of the algorithm. Moreover, tuning the n parameter allows one to strike a balance between the accuracy (lower value of n) and the speed of the algorithm (higher value of n).

Popular Python implementations of RFE are RFE and RFECV, both available in scikit-learn. However, there are two issues to consider when using them, described in the following subsections.

The pitfall of feature importance in RFE

As already mentioned, RFE strongly depends on the feature importance metric used to choose the variables to be removed at each iteration. In the case of the scikit-learn implementations RFE and RFECV, these feature importance metrics are the following: the coefficients (betas) for linear models and the impurity-based feature importance for tree-based models.

The coefficients of a linear model represent the magnitude and the direction, with which values of a given feature impact the prediction. Using RFE with the Logistic Regression algorithm, for example, allows one to select a subset of features with a strong linear relationship with the target. However, due to the simplicity of linear models, it disregards the predictive power of relations between variables, as well as the non-linear patterns in the data. Therefore, this approach is suitable for problems where the relationship between the data and the target is relatively simple.

Decision-tree-based models allow one to exploit highly non-linear relationships in the data. Therefore, using RFE with the Random Forest algorithm, for example, allows one to find a subset of features that contains complex patterns that are highly predictive for the target. Unfortunately, RFE and RFECV (as implemented in scikit-learn) use the impurity-based importance metric for tree-based models by default.

The impurity-based importance is calculated as the normalized total reduction of the split criterion, brought by a given feature in all trees of the model. Essentially, it represents how high and how often a given feature is used within the model’s tree(s). Not only this does not directly translates to the importance of the feature, but also, the method is strongly biased towards high cardinality features, which means that the variable having more unique values, has a higher chance of having high importance. Thus, using RFE or RFECV with tree-based model, for instance, Random Forest or XGBoost, might lead to biased results.

Why you need hyperparameter optimization in RFE

Another crucial aspect of RFE is that, while there is a decrease in the number of features at each iteration, the number of samples remains the same at each iteration. Depending on the model and its parameters, this may cause the model to overfit in the initial iterations and underfit towards the final iterations. As a consequence, the feature importances of such models may be far from optimal. In order to ensure that the model fits the data well throughout this process, one should optimize the hyperparameters at each iteration before calculating feature importance.

When it comes to simple linear models, the risk of over- or under-fitting is low. Furthermore, one can apply LASSO (e.g. using LassoCV) to adjust the amount of regularization applied to the current feature set.

However, as far as tree-based models are concerned, there are multiple parameters to optimize, which significantly influence the performance of the model and its feature importance ranking. In order to tune these hyperparameters, search schemas such as GridSearchCV or RandomizedSearchCV are often applied. Unfortunately, use of these techniques in scikit-learn’s RFE/RFECV is not currently supported, which may consequently bias the output feature set for tree-based models.

Recursive Feature Elimination using SHAP

These issues are addressed by Probatus, an open-source Python package built by Data Scientists at ING Bank. Probatus provides tools to analyze binary classification models, with attention also to the data used to develop them. The package now provides ShapRFECV, a RFE algorithm for tree-based binary classification models that supports hyperparameter tuning at each iteration while using SHAP to estimate feature importance.

SHAP feature importance — the new best practice

SHAP is an open-source Python package that, as described in their documentation,

”… is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions….”

In a nutshell, SHAP computes Shapley values, which represent the contribution of a given feature towards the prediction of the model for a given sample. The further the value is from 0 for a given prediction, the stronger the influence of that feature is, while the sign of the value represents the class towards which the prediction is pushed by that feature.

The SHAP importance of a given feature is computed by taking the average of the absolute Shapley values computed on a given dataset. It represents the average strength with which that feature influences predictions on a given dataset, e.g. the test set.

The main advantages of SHAP feature importance are the following:

  • Its core, the Shapley values, has a strong mathematical foundation, boosting confidence in the results.
  • SHAP also takes into consideration relationships between features while calculating the Shapley values.
  • The computation of Shapley values for tree-based models is very efficient.

Moreover, using SHAP feature importance within RFE provides a more accurate and less biased way of scoring features of tree-based models, without significantly increasing computation time. Taking these advantages into consideration, it is no suprise that SHAP is quickly gaining popularity, not just at ING, but also in the data science community.

ShapRFECV in Probatus

ShapRFECV in Probatus implements the RFE algorithm for tree-based classification models using SHAP feature importance. In each iteration of RFE, the following steps are applied:

  1. (Optional) Tune the hyperparameters of the model using GridSearchCV or RandomizedSearchCV.
  2. Apply cross validation (CV) to estimate the SHAP feature importance on the provided dataset. In each CV fold, the tuned model is fitted on train splits, then evaluated on validation splits to estimate the SHAP feature importance.
  3. Remove n lowest SHAP importance features from the dataset, where n is a parameter.

Furthermore, the choice of the optimal number of features often depends on the performance of the tuned model at each iteration of RFE. Optimally, one should select the lowest number of variables for which the validation performance on the target evaluation metric is high. The Probatus implementation also provides a convenient visualization that allow the user to better choose the optimal number of features.

Example use

Let us see ShapRFECV used in practice for the following toy dataset:

from probatus.feature_elimination import ShapRFECV
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, RandomizedSearchCV
import numpy as np
import pandas as pd
import lightgbm
feature_names = ['f1_categorical', 'f2_missing', 'f3_static', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20']# Prepare dataset
X, y = make_classification(n_samples=1000, class_sep=0.05, n_informative=6, n_features=20, random_state=0, n_redundant=10, n_clusters_per_class=1)
X = pd.DataFrame(X, columns=feature_names)X['f1_categorical'] = X['f1_categorical'].apply(lambda x: str(np.round(x*10)))
X['f2_missing'] = X['f2_missing'].apply(lambda x: x if np.random.rand()<0.8 else np.nan)
X['f3_static'] = 0

Next, we need to choose the model and the hyperparameter optimization search schema to be used:

clf = lightgbm.LGBMClassifier(max_depth=5, class_weight='balanced')param_grid = {'n_estimators': [5, 7, 10], 'num_leaves': [3, 5, 7, 10]}search = RandomizedSearchCV(clf, param_grid, cv=5, scoring='roc_auc', refit=False)

Having the target dataset, the model, and the hyperparameter search space, we can now apply ShapRFECV:

shap_elimination = ShapRFECV(search, step=0.2, cv=10, scoring='roc_auc', n_jobs=3)
report = shap_elimination.fit_compute(X, y)

The code above uses randomized grid search to tune the hyperparameters of the model at each iteration of ShapRFECV. Moreover, the step parameter defines how many, or what percentage of, features to remove in each iteration. You can use it to modify the speed with which the elimination is performed. For example, step=1 is the most thorough setting, but also has the highest computational cost.

performance_plot = shap_elimination.plot()

The plot above illustrates the model performance (using the AUC metric) for train and validation splits of the model, for different numbers of features. The line represents the CV-averaged AUC score, while the shaded region behind shows the standard deviation of the CV scores. We see that the validation AUC starts dropping around 8 or 7 features. Therefore, we can safely eliminate 12 out of 20 features without a significant loss of performance.

We can output the final feature set as follows:

shap_elimination.get_reduced_features_set(num_features=8)

[‘f9’, ‘f16’, ‘f1_categorical’, ‘f14’, ‘f15’, ‘f19’, ‘f5’, ‘f8’]

ShapRFECV vs. RFECV

Let us set up an experiment, in which we will compare the performance of the model, trained on features selected by ShapRFECV and by the scikit-learn RFECV. Below, we present the code snippet used to set up the experiment and the most important plots. The full code used in this experiment is available in the following page.

from probatus.feature_elimination import ShapRFECV
import numpy as np
import pandas as pd
import lightgbm
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFECV
# Prepare train and test data:
X, y = make_classification(n_samples=10000, class_sep=0.1, n_informative=40, n_features=50, random_state=0, n_clusters_per_class=10)
X = pd.DataFrame(X)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)# Set up the model:
clf = lightgbm.LGBMClassifier(n_estimators=10, num_leaves=7)
# Run RFECV and ShapRFECV with the same parameters
rfe = RFECV(clf, step=1, cv=10, scoring='roc_auc', n_jobs=3).fit(X_train, y_train)
shap_elimination = ShapRFECV(clf=clf, step=1, cv=10, scoring='roc_auc', n_jobs=3)shap_report = shap_elimination.fit_compute(X_train, y_train)

Now, we can visualize the results of both methods:

The plot above presents the averaged CV Validation AUC of model performance for each round of the RFE process in both ShapRFECV and RFECV. The optimal number of features is 21 for the former, and 13 for the latter.

Let us compare the performance of the model trained on:

  • All 50 available features (baseline),
  • 13 features selected by RFECV (optimal),
  • 21 features selected by ShapRFECV (optimal),
  • 13 feature selected by ShapRFECV (baseline).

As shown in the plot, ShapRFECV provides superior results for both: CV Validation and Test AUC, compared to RFECV and the baseline model with all the available features. Not only the introduced method allows to eliminate features without the loss in performance, but also it may increase the performance of the model.

When it comes to time required to perform the feature selection in the experiment above, RFECV takes 6.11 s ± 33.7 ms, while ShapRFECV takes 10.1 s ± 72.8 ms, which shows that the latter is more computationally expensive, due to SHAP values calculation.

Conclusion

In summary, ShapRFECV is an implementation of RFE for tree-based models that benefits from the advantages of the original RFE algorithm and extends it with additional functionality suited for tree-based models. By using SHAP feature importance and supporting hyperparameter optimization, one can achieve more accurate and reliable results.

ShapRFECV has recently been released to Probatus 1.5.1. You can help us refine it by contributing to the package or making an issue in the GitHub repository.

Special thanks to Probatus team, especially Ryan Chaves and Tim Vink for their contributions to the ShapRFECV feature and this post.

--

--