# Is this the Best Feature Selection Algorithm “BorutaShap”?

# What is Feature Selection ?

Feature selection is an important but often forgotten step in the machine learning pipeline. The process involves reducing the dimension of the input space by selecting a relevant subset of the input features. **But why should we care ?**

- Firstly, the reduction in model size and complexity makes it easier to interpret.
- The smaller model is now faster and easier to train. Also with less features to create the feature engineering pipeline can also be simplified not to mention data collection.
- Feature selection can also reduce over fitting by removing noisy features. Thus increasing the models ability to generalize to new unseen data and increasing accuracy.

The central idea behind using a feature selection algorithm is that modern datasets will contain a certain amount of *redundant* or *irrelevant *features, that can be removed without causing a significant loss in information.

A feature selection algorithm can be broken down into two components a search technique, which proposes new subsets along with an evaluation metric to score these new subsets. The simplest feature selection algorithm would therefore ideally test every possible subset of the input features and select the one which minimizes the error rate of the model. However, this is an exhaustive search and is not computationally feasible for any large data set as the number of subsets is exponential. With the number of subsets defined as follows.

# What’s out there ?

Well feature selection methods are typically presented in three classes based on how they combine the selection algorithm and the model building. These three methods include** filter**, **wrapper** and **embedded** methods.

## Filter Methods

Filter methods select the subset of features independently from the model. Their main premise is that they use some sort of correlation metric with the variable to predict. Common metrics used include, *“Mutual Information”* , *“Point-wise Mutual Information”* or even simply *“Pearson Coefficient”. *As filter methods are based on these statistical metrics they* *usually less computationally intensive than wrapper methods. However the subset of features selected is not tuned to our model nor has it considered the relationships or interactions between variables themselves.

## Wrapper Methods

Wrapper methods use a predictive model to score feature subsets, where each new subset is used to train a model which is then tested on a hold out set. The prediction error is computed and used as a representative score for that subset. As you can imagine training a new model on each subset is very computationally expensive, but this method usually provides the best performing feature set for that particular type of model or typical problem. A example of this type of method would be *“Step Wise Regression”*.

## Embedded Methods

Embedded methods try to combine the advantages of the previous two methods by incorporating feature selection as part of model construction. The text book example of this is the “*LASSO Regression*”, which penalizes the regression coefficients with an L1 penalty. Any features which have non-zero regression coefficients are essentially *“Selected”* by the LASSO algorithm.

# What is Boruta ?

“**Boruta**” is an elegant wrapper method built around the **Random Forest** model. The algorithm is an extension of the idea introduced by the “**Party On**” paper which determines feature importance by comparing the relevance of the real features to that of the random probes. In short the algorithm operates as follows; if you would like a more detailed step by step breakdown of the algorithm I would suggest this medium **article**.

- Start by creating new copies of all the features in the data set and name them shadow + feature_name, shuffle these newly added features to remove their correlations with the response variable.
- Run a random forest classifier on the extended data with the random shadow features included. Then rank the features using a feature importance metric the original algorithm used permutation importance as it’s metric of choice.
- Create a threshold using the maximum importance score from the shadow features. Then assign a hit to any feature that had exceeded this threshold.
- For every unassigned feature preform a two sided T-test of equality.
- Attributes which have an importance significantly lower than the threshold are deemed ‘unimportant’ and are removed them from process. Deem the attributes which have importance significantly higher than than the threshold as ‘important’.
- Remove all shadow attributes and repeat the procedure until an importance has been assigned for each feature, or the algorithm has reached the previously set limit of the random forest runs.

In my opinion “Boruta” is one of the best automated feature selection algorithms available, as it provides accurate and stable results. However, it can be improved in two domains especially with regards to computational complexity and the metric that it uses to rank a feature’s importance. Thus, introducing SHAP

# SHAP (SHapley Additive exPlanations)

SHAP is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations to create the only consistent and accurate explainer. From a **very** high level, SHAP essentially calculates the average marginal contributions for each feature across all permutations at a local level. As this method is additive the mean value of these marginal contributions can be then used to achieve a global feature importance.

SHAP truly has unified the field of model interoperability, and in my eyes is one of the most impactful recent additions to the machine learning community. If you haven't already, I would highly recommend delving more into the field of** **interpretable machine learning and there is no better place to start then with *Christoph Molnar’s “*A Guide for Making Black Box Models Explainable*” .*

However, one flaw is that SHAP’s kernel explainer which evidently has to create and measure the marginal contribution of every coalition in the dateset is a computationally expensive process. Luckily as the “Boruta” algorithm is based on a Random Forest, there is a solution TreeSHAP, which provides an efficient estimation approach for tree-based models reducing the time complexity:

# The Problem with Boruta/ Boruta+Shap

Previously we outlined two major problems with the original implementation of the “Boruta” algorithm the first being computational complexity with the second being the consistency/accuracy of the feature rankings themselves. However, both of these problems can be simply resolved by switching the feature importance algorithm (permutation importance) with the TreeSHAP explainer.

Just to re-iterate the “Boruta” algorithm adds noise to our data set by creating these random permutations of the original features. The algorithm then fits a Random Forest to this noisy data and uses a feature importance ranking to decide whether or not a feature should be kept or not.

Traditionally, global feature importance values of tree ensemble methods are calculated in three primary ways:

- Gain: Gain is a classic approach and is calculated as the the total reduction of loss or impurity contributed by all splits for a given feature. This is the default feature importance method implemented in Scikit-learn’s tree ensemble methods.
- Split Count: A second common approach is simply to count how many times a feature is used to make a split inside the tree structure.
- Permutation: A third common approach is to randomly permute the values of a feature in the test set and then observe the change in the model’s error. If a feature’s value is important then permuting it should create a large increase in the model’s error.

The original implementation of “Boruta” package in R uses the Permutation based approach this is the definitely the most consistent method of the three however, it scales badly:

Surprisingly there is no Python implementation of the original “Boruta” package. Although there is a implementation “Boruta_py” that uses the Mean Decrease in Impurity as it feature importance score despite its speed **neither this or split count are reliable measures of global feature importance**, despite there wide spread use.

# Comparison

To test the benefits of “BorutaShap” I created a Python implementation that included all three metrics (Gain, Shap and Permutation). I then compared the various methods on three test datasets to evaluate their performance respectively.

In each iteration of the test a default Scikit-learn Random Forest was used as the base model. However, it should be noted that both the Shap and Permutation importance metrics are model agnostic .Whereas, in contrast the Gain importance can only work for tree based models.

**Gain Importance: **To compute the gain metric the base “*feature_importances_**”*** **method implemented in Scikit-learn was used. This approach is also sometimes called “gini importance” or “mean decrease impurity” and is defined as the total decrease in node impurity weighted by the probability of reaching that node averaged over all trees of the ensemble.

**Permutation Importance:** Whereas, for the Permutation Importance, “*sklearn.inspection’s”* implementation of “permutation_importance**” **was used. This method calculates the mean decrease in accuracy of the model by randomly shuffling a feature in dateset. This procedure of shuffling the feature breaks it’s relationship with the target, thus the drop in the model accuracy is indicative of how much the model depends on the feature.

**SHAP Importance**: As seen above the SHAP python package created by “slundberg” was used to calculated the importance values. Within this package their are numerous different explainers which are optimized for different model structures. Therefore for the TreeExplainer is used to explain the behavior of the Random Forest model. which is a fast, model-specific alternative to KernelSHAP for tree based models.

## Ozone Dataset

The Los Angeles ozone pollution dataset contains 366 daily observations on 13 variables from 1976. This is a regression problem where the goal is to predict the daily maximum one-hour-average ozone reading (‘V4’) using the other variables. The correct names of all the variables are as follows

- “V1” Month: 1 = January, …, 12 = December
- “V2” Day of month
- “V3” Day of week: 1 = Monday, …, 7 = Sunday
- “V4” Daily maximum one-hour-average ozone reading
- “V5” 500 millibar pressure height (m) measured at Vandenberg AFB
- “V6” Wind speed (mph) at Los Angeles International Airport (LAX)
- “V7” Humidity (%) at LAX
- “V8” Temperature (degrees F) measured at Sandburg, CA
- “V9” Temperature (degrees F) measured at El Monte, CA
- “V10” Inversion base height (feet) at LAX
- “V11” Pressure gradient (mm Hg) from LAX to Daggett, CA
- “V12” Inversion base temperature (degrees F) at LAX
- “V13” Visibility (miles) measured at LAX

As seen below in all three variants of the “Boruta” algorithm were used to subset the Ozone dataset. With each variant achieving different results. Both “Shap” and “Gain” achieved similar subsets although “Shap” interpreted “V1” or the Month of year to be a significant feature. In contrast the “Permutation” variant was much more lenient and considered both “V1” and “V5” to be significant features.

## Breast Cancer Dataset

The Wisconsin breast cancer data set contains 569 instances with 32 features. Each observation relates to a single patient and the challenge is to classify whether the scan is benign or malignant. The features from this data set describe characteristics of the cell nuclei and are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. As described in UCI Machine Learning Repository, the attribute information's are:

- a) radius (mean of distances from center to points on the perimeter)
- b) texture (standard deviation of gray-scale values)
- c) perimeter
- d) area
- e) smoothness (local variation in radius lengths)
- f) compactness (perimeter² / area — 1.0)
- g) concavity (severity of concave portions of the contour)
- h) concave points (number of concave portions of the contour)
- i) symmetry
- j) fractal dimension (“coastline approximation” — 1)

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features.

From the figure below we can observe a difference not only between the feature subsets produced but also the ranking of the individual features. The fact that “Shap” provides the only consistent interpretation means that these outputs can also be used for inference about the models behavior on a global level. To test the quality of the subsets produced by the three variants I preformed a 5 fold cross validation on each of the three subsets produced the results of which are tabulated below.

Here we can see that the subset produced by the “Shap” variant actually increases the accuracy of our “Random Forest” classifier by one percent. Although this may not sound all to impressive it reduced the feature subspace from 32 features to 21 that is a reduction of approximately 28 % thus we now have a more accurate and simpler model than before. One thing to note is that the Gain variant also achieved the same accuracy as the “Shap” variant however choose a larger and different subset of features. Also the last column considers the time taken for each variant we can see that the Gain variant is significantly faster than the others as it is independent of the number of observations as it uses the structure of the model itself.

## Madelon Dataset

According to the UCI Machine Learning Repository the Madelon is an artificial data set containing data points grouped in 32 clusters placed on the vertices of a five dimensional hypercube and randomly labeled +1 or -1. The five dimensions constitute 5 informative features. 15 linear combinations of those features were added to form a set of 20 (redundant) informative features. 480 redundant randomized features have then been added to create a data set containing 4400 observations and 500 columns.

Again the subsets produced are also quite different with “Shap” choosing 15 features to be statistically significantly , whereas botht the “Gain” and “Permutation” variants choose 19 and 20 features respectively . Surprisingly , all three variants achieved similar accuracy levels however the model produced from the “Shap” subset is a much similar with almost 25% less features than the other two models. One comparison to note is the linear dependence of both “Shap” and “Permutation” to the number of observations in the data set with both algorithms having a much longer run time relative to the “Gain” variant. Also another interesting aspect, is how different individual feature rankings are. With each variant essentially producing their own interpretation of the random forest model.

## Time Complexity

When running the above analysis it was hard not to notice the differences between the run times for each of the variants. So I decided investigate it further by creating two experiments which highlight the dependency of each variant on the number of rows and observations. As we can see below the Gain/Gini method is independent from the size of the data set as it uses the internal structure of the tree to interpret the model, this allows it’s runtime to stay consistent across various data sizes. Whereas, both the “Shap” and “Permutation” algorithms are dependent on the input data to create their interpretations meaning the have scaling issues. However, “Permutation” importance is linearly dependent on both the number of observations and columns, while “Shap’s” TreeExplainer is linearly dependent on the number of observations it is not dependent on the number of columns but rather the complexity of the Tree as seen above O(TLD²).

This creates a dilemma, despite the “Shap” implementation’s relative successes in both accuracy and time complexity, this algorithm will struggle on huge data sets from a time perspective. Furthermore, the “Boruta” algorithm cannot be parallelized as it is computed sequentially. Instead I propose to reduce the runtime by sampling a proportion of your data, at the expense of some estimation variance.

Rather than simply taking a random sample at each iteration, I created a method to find the smallest representative sample at each iteration. To do this I represented the data as an anomaly distribution by scoring every instance with an Isolation Forest which returns a score between -1 and 1. Then starting at a sample size of 5% of the original data, I take a random sample where the probability of selection is weighted by how weird/strange this data point is using the Isolation Forest. This sample is then compared to the original distribution using a Ks-test and only samples with a score above 0.95 confidence level are accepted. If no samples are found within a set number of sampling tries then the size of the sample is increased by 5% and so on. In testing this method preformed really well reducing the time on some problems by up to 70%. However, this method does add more variance to the “Shap ”values themselves. Despite this the overall ranking and feature subset produced remains an accurate representation of the entire data set.

Before settling on this approach I did implement another sampling technique inspired by the K-means sampling approach provided by the “Shap” Kernel Explainer. This method clusters the data and uses the centroids of the clusters found from the K-means algorithm as the sample points. These samples are then weighted by the number of points in the cluster they represent. This method is nice as the number of samples is very low providing excellent improvements in speed. However, the overall feature ranking and feature subset left a lot to be desired.

# BorutaShap

After all that I decided to package this code up and store it on the PYPI artifactory for anyone to use. This package can easily be installed using the pip install command as follows .

`pip install BorutaShap`

This package has a few nice features that would certainly stand out to me as potential user.

- It is the only implementation I know of that combines the “Boruta” Algorithm with Shapley values.
- Second it can be used with any tree based learner of your choice. (In the future this may be expanded depending on demand)
- Despite “Shap’s” lack of speed a unique sampling method has been implemented to speed up the process, also a gain importance metric can also be used to speed things up if desired.
- The results can also be easily visualized and if the data set is too large to visualize, the results can be simply exported to a csv.

## Example

In my opinion the API has a very accessible and user friendly. To demonstrate this to you I will go through a simple feature selection process on the Boston House Pricing data set. This is a staple data set in the machine learning community where the challenge is to use 13 real valued features to predict the median house value. More information on this data set can be found here.

In order to use the “BorutaShap” feature selection tool we simply have to initialize it by creating an instance. This initialization takes a maximum of 5 parameters including a tree based model of your choice example a “Decision Tree ”or “XGBoost” default is a “Random Forest”. Which importance metric you would like to evaluate the features importance with either Shapley values (default) or Gini importance, A flag to specify if the problem is either classification or regression, a percentile parameter which will take a percentage of the max shadow feature thus making the selector less strict and finally a p-value or significance level which a after a feature will be either rejected or accepted. In this example we will use the all the default Parameters.

After the instance has been created we simply need to call the fit function which takes 5 Parameters also. X & y are self explanatory and the random state has been added for reproducibility. The n_trials refers to the number of trials or iterations of the algorithm that you wish to run. Also there is a boolean sample parameter that when **True **will use the sampling technique mentioned above to speed up the process.

As we can see the algorithm has confirmed 9 out of the 13 features to important predictors of our target variables. Whereas, 4 out of the 13 features have been rejected as they consistently preformed worse than the random variables.

In this case zero attributes were tentative, a tentative feature is one produced when the algorithm is unsure whether to reject or accept a feature. When this case arises you have three solutions. You can either run the algorithm for more iterations, look at the visual graph and decide whether it is important or not. Or you can use the TenativeRoughFix() function which will compare the median importance value of the tentative features with the median importance from the maximum random variable to decide to accept or reject the feature.

To visualize the results you can simply call the plot function which has a few parameters the the log of the y axis is chosen at default as it leads to a better visualization as without it the boxes are hard to see with the difference in scale. The *which_features* parameter has four different modes, ‘all’, ‘accepted’, ‘rejected’ and ‘tentative’ as sometimes the plot becomes overcrowded . If this is still the case there is a function to save your results to a csv to make it easier to understand.

We can then simply call the subset function to return the optimal feature subset of the original data frame to either save or use for modelling purposes.

For more examples on how to use this package please look at these notebooks. The following embedded notebook shows you how to use “BorutaShap” with various different Tree Based models. I thought the results were interesting as each model returns its own optimal subset of the data. Which truly shows the importance of using your selected model for the feature selection process.

# Conclusion

“BorutaShap” definitely provides the most accurate subset of features when compared to both the “Gain” and “Permutation” methods. Although, I know “there is no free lunch” the “BorutaShap” algorithm is a great choice for any automatic feature selection task. Not only does this algorithm provide the best subset of features but in theory it is model agnostic, allowing you to replace the “Random Forest” with your intended model. Also, the “Shapley” values provide the most accurate and consistent global ranking results, this combined with the variance from each iteration is an excellent model inference tool giving the user a true insight into the inner workings of their black box model.