Selecting the Best Feature Selection Method

Insights based on real-world datasets

7 min readApr 21, 2021

Written By: Yuval Cohen, Noa Cohen and Aviad Atlas.

Introduction

One of the subjects that every data scientist, junior or senior, comes across in every data science project is feature selection. In this two-part article, we want to share our experience regarding real-world problems. In Part 1 below, we present a time analysis of the different feature selection methods we used and describe how they affect the accuracy of our models. In Part 2, we will address the interesting idea of combining feature selection methods — how to do it and why you should.

Our data science team at Blue dot (AKA — The Vatvengers) deals with production issues as well as research tasks and new ideas. One of the problems that we tackle a lot is feature selection, as you can imagine. Some of our models contain a vast number of features that can slow the training process and cause a loss of accuracy on the test set. So, we decided to take a deep dive into the problem and came back with new and interesting insights.

Methods Descriptions

ReliefF

The main idea behind ReliefF algorithms is to estimate the quality of features according to how well the feature can distinguish between samples that are close to each other. ReliefF belongs to a family of Relief algorithms used specifically for multi-classification problems. You can find more information about this method here.

import numpy as np
from ReliefF import ReliefFdef relieff(X_train, y_train, n_neighbors, n_features_to_keep):
   fs = ReliefF(n_neighbors=n_neighbors,    n_features_to_keep=n_features_to_keep)
   X_train_new = fs.fit_transform(np.array(X_train), np.array(y_train))
   pos = pd.DataFrame(fs.feature_scores.reshape(-1,1)).sort_values(by=0, ascending=False).head(n_features_to_keep).index.tolist()
   return list(X_train.columns[pos])

Chi-square

In this method, we calculate chi-square statistics between each feature variable and the target variable to understand the relationship between them. If the target variable is independent of the feature variable, we can discard that feature variable. If they are dependent, the feature variable is important. You can find more information about this method here.

from sklearn.feature_selection import SelectKBest, chi2def chi_square(X_train, y_train, n):
   selector = SelectKBest(chi2, k=n)
   selector.fit(X_train, y_train)
   cols = selector.get_support(indices=True)
   cols_names = list(X_train.iloc[:, cols].columns)
   return cols_names

Boruta

The basic concept is that we train a tree-based model (e.g., Random Forest), followed by creating shadow features for each feature in our set. We then filter out features that performed poorly compared to the shadow features based on the model’s built-in feature importance. You can find a more in-depth explanation here.

from boruta import BorutaPy
import numpy as np
from sklearn.ensemble import RandomForestClassifierclf = RandomForestClassifier(n_jobs=-1)def boruta(clf, X, y):
   boruta = BorutaPy(estimator=clf, n_estimators='auto', max_iter=100  # number of trials to perform)
   ### fit Boruta (it accepts np.array, not pd.DataFrame)
   boruta.fit(np.array(X), np.array(y))
   green_area = X.columns[boruta.support_].to_list()
   blue_area = X.columns[boruta.support_weak_].to_list()
   return green_area, blue_area

Permutation Importance

In this method, we shuffle each feature separately and fit a new model. We keep the features for which we observe higher drops in accuracy (or any other metric). You’ll find additional explanations in sklearn’s documentation.

from sklearn.inspection import permutation_importancedef permutation_importance(clf, X, y, n_repeats):
   result = permutation_importance(clf, X, y, n_repeats=n_repeats, random_state=0)
"""
The following is a proposal in scikit-learn's doc to select the features
"""
   chosen_feats = []
   for i in result.importances_mean.argsort()[::-1]:
   if result.importances_mean[i] - 2 * result.importances_std[i] >0:
   chosen_feats.append(X.columns[i])
   return chosen_feats, result

RFE

RFE is a wrapper-type feature selection algorithm. RFE works by searching for a subset of features by starting with all features in the training dataset and successfully removing features until the desired number remains. How is it different from using backward selection? After creating the full model, we measure the feature importance value per feature and sort the predictors fromhighest to lowest. At each iteration, we drop the least important predictors. To compute feature importance, we have ML algorithms that offer us these scores (e.g., Random Forest) or use a
more general approach that is independent of the full model. See additional explanations here.

from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifierdef rfe(X_train, y_train, n):
   model = DecisionTreeClassifier()
   rfe = RFE(model, n_features_to_select=n, step=1, verbose=2)
   rfe = rfe.fit(X_train, y_train)
   return rfe.support_

MI

Mutual Information is used to measure dependency between two variables. To describe MI we start with Shannons definition for entropy given by:

This equation represents the uncertainty in output Y. Suppose we observe a variable X; then the conditional entropy is given by:

And this equation implies that by observing a variable X, the uncertainty in the output Y is reduced. The decrease in uncertainty is given as:

This method can be used both in supervised and unsupervised feature selection tasks. We can apply it in both regression and classification since we are only interested in the relationship between any two variables or between each variable to the response variable. In our case, we measured the MI score between each feature and its corresponding label and created a list of scores. Now, to filter out the less meaningful features, we have several options. Let’s look at three of them:

K-best features: Choose the top k features with the highest MI scores.
2.Setting a threshold: Take all features above a given threshold
.3.Cumulative “energy”: Calculate a new score for each feature:

In other words, the ratio between the MI score for the i-th feature to the sum of all the MIs scores, sort them top-down, and then compute a cumulative sum over it. Now we can set an “energy” threshold (e.g., 80%) and take the set of features that combining the threshold.

from sklearn.feature_selection import mutual_info_classifdef mi_fs_classification(X, y, discrete_features):
    mi = pd.DataFrame(mutual_info_classif(X, y, discrete_features=discrete_features, random_state=42, n_neighbors=3, copy=True))
   return mi

Results

The following table compares the different methods described above. We took three different tabular datasetsfrom our domain and tested for accuracy and runtime of the methods. We used a tree-based model for training, a Random Forest. For each dataset, we created five test sets, which enabled us to compare the accuracy of the feature selection methods more precisely. We also compared the results performing a t-test (presented as a tuple of t-value, p-value) between the accuracies when using all features, and the accuracies when using only theselected features according to the method used.The datasets attributes are as follows:

Dataset1 — Multiclass classification. Train set of 100K rows, with 4391 features.
Dataset2 — Multiclass classification. Train set of 10K rows, with 400 features.
Dataset3 — Binary classification. Train set of 398K rows, with 44 features.

The code to use in order to calculate the t-test, using the scipy package:

from scipy.stats import ttest_ind
ttest_ind(accuracies_all_feats, accuracies_selected_feats)

This outputs the t-value and the corresponding p-value.

If you’re not familiar with the concept of hypothesis tests, here’s a description of our t-test usage:

1.The Null hypothesis (H0) is always the conservative assumption. In our case, we’ll assume that the accuracy of the full model isequalto the accuracy of the partial model.
The alternative hypothesis (H1) is the assumption we want to verify. In our case, we want to check if the accuracy of the full model isnot equalto the accuracy of the partial model. This is called two tails test, because we are testing both directions (i.e. x>y or x<y)

If we succeed in rejecting H0, we can conclude H1. But happens if we don’t reject H0? In that case, we can conclude that the accuracy remains the same, but with fewer features.

Highlights from the table:

1. The Boruta method tends to filter many of the features.It’s an advantage that the number of features to leave doesn’t need to be defined, but sometimes too many features are removed (dataset2)

2. The Chi2 method works very fast, and we see that it is robust throughout the different datasets. The only downside is that we need to define how many features to leave (perhaps grid search is possible).

3. The permutation importance method is quite robust (and doesn’t require defining the number of features). The downside is that it runs for a long time.

4. The RFE method isn’t very feasible, as we couldn’t run it on dataset1 before stopping it.

Summary

In this article, we briefly described some useful feature selection methods. We also made a comparison between these feature selection methods on real-world datasets. We showed that generally, the methods are useful to reduce the dimensionality, though they usually do not improve the accuracy of the model (and in some cases, the accuracy dropped a lot).

In Part 2, we will show a more complex method of feature selection, combining the results of the different methods above. Please join us!