Feature Selection Methods for Data Science (just a few)

Svideloc
Analytics Vidhya
Published in
6 min readJan 31, 2020

Why Feature Selection Matters

Before we get started, let’s look into why feature selection should be a part of your models. There are two main reasons for feature selection. First, simpler models are (usually) better. They are easier to understand and scale, and they will be cheaper to run in the long term. By reducing the number of features through feature selection, you are able to identify the important features that the model needs to perform well while decreasing the size of your overall data set. Second, you can cause over-fitting with too many features, meaning your model is learning very well on the training data set but does not do so well on new data that it has not seen before.

Some basic feature selection methods can use a simple Pearson Correlation or the Chi-Squared method, but we will focus on a few others for the purposes of this post.

Methods

1. Recursive Feature Elimination (RFE)

RFE aims to “select features by recursively considering smaller and smaller sets of features.” Essentially this method trains the model on the original number of features, and an importance is given to each feature. The least important features are kicked out, and then the process is repeated to a specified number of features. See Figure 1 for a basic understanding of how RFE is working.

Figure 1: Recursive Feature Elimination Methodology

Example

We will look at the sklearn breast cancer data-set for a simple model. My full code can be found on my GitHub by clicking here, but I will only show parts of the code for purposes of this blog.

Let’s look at the code for logistic regression without recursive feature selection.

#No Feature Selection
lr = LogisticRegression(class_weight = 'balanced', solver = 'lbfgs', random_state=42, n_jobs=-1, max_iter=500)
lr.fit(X_train, y_train)
y_guess = lr.predict(X_train)
y_score = lr.predict(X_test)

Now let’s look at how it looks with RFE:

from sklearn.feature_selection import RFE
rfe = RFE(lr, n_features_to_select=7)
rfe.fit(X_train, y_train)
y_guess = rfe.predict(X_train)
y_score = rfe.predict(X_test)

Let’s look at the results of these:

Original Model Vs. RFE Selected Features Model

From here you can see that we actually score a bit higher with the RFE version, likely due to throwing out a bit of noise in the data. Also, the RFE model only used 7 features which is much more efficient than the original model.

Benefits

This method should work well for most linear type models and will run relatively quickly, which is a bonus. RFE is also computationally less complex than the next feature selection methods that we will see.

Drawbacks

Note the parameter ‘n_reatures_to_select.’ RFE is not the smartest feature selection method, so you need to tell the model how many features you want to select. It will run its elimination until it uses the number of features you specified. This means you have to adjust this number a bit to see where the feature sweet spot is, which is how I got 7 features. You could also write a for loop for this to try different numbers of features!

If you don’t specify a number of features, RFE will aim to remove half of the features. This can be problematic because you could be eliminating too many features or not enough. That is why it is imperative that you choose parameters mindfully.

2. Sequential Feature Selector (SFS)

From the documentation: SFA’s “remove or add one feature at a time based on the classifier performance until a feature subset of the desired size k is reached,” k being the number of features which will be less than the original dimensional feature space, d. In Figure 2 below, the model first runs each feature as its own model and then chooses Feature 2 as the best feature. From there it then pairs Feature 2 with each other feature and determines that 2 and 3 perform best. It groups them and adds in each feature again, and sees that 2, 3, and 1 have the best model. It will keep iterating in this manner.

Figure 2: Sequential Feature Selection Methodology

Example

Once again, let’s look at the same example, but this time we will use Sequential Feature Selection.

# Sequential Feature Selection
from mlxtend.feature_selection import SequentialFeatureSelector
sfs = SequentialFeatureSelector(lr, k_features='best', forward = False, n_jobs=-1)
sfs.fit(X_train, y_train)features = list(sfs.k_feature_names_)
lr.fit(X_train[features], y_train)
y_score = lr.predict(X_test[features])

Let’s look at the results of this method:

Original Model vs. SFS Selected Features Model

Okay, so we eliminated 7 features and got a slightly worse score using SFS. However, I did not have to manually try and optimize the number of features to run this like in RFE, so we have a slightly more scale-able model now.

Benefits

Run time for SFS is a little longer than RFE but is manageable. The main benefit is that this method chooses on its own and can be more intelligent than RFE. If you are lazy than SFS does a little bit more thinking than RFE.

Drawbacks

The user has a little less control with this method and can be prone to not selecting the absolute best combination of features as we saw in the above example. Nonetheless, it will still reduce down the dimentionality of your data-set fairly well.

2. Exhaustive Feature Selector (EFS)

Exhaustive feature selection will be the most robust of the three methods covered in this blog. This is a brute-force evaluation of each feature subset meaning that it tries every possible combination of features and chooses the best performing model. In Figure 3, we have four features. EFS then tries each possible combination using those features, and in the diagram finds that Features 1, 3, and 4 had the best model.

Figure 3: Exhaustive Feature Selection Methodology

Benefits

If you have the computing power, you are sure to optimize your feature selection using Exhaustive Feature Selection. However, I have not used this method in practice due to the time it takes to run. It may also be useful if you do a simpler feature selection method to reduce your features and then try this when you have much fewer features in your data-set.

Drawbacks

As stated earlier, EFS will take a lot of computing power. For just a four feature model, this method requires the model to be run 15 times, as you can see in Figure 3. As the number of features increases, this will turn into an extreme amount of models.

For our 30 feature breast cancer data-set example, this means that the model will have to try 107,3741,823 different combinations (2³⁰-1). This is probably not a good idea!

Note that you can specify a maximum number of features to use as a parameter when using this method, but the model will still have many iterations to run through.

Conclusion

In Data Science, there is never a one size fits all solution to a problem, and with feature selection that holds true. The methods outlined in this post serve to provide a few options that may be useful to try and eliminate excessive features in your models, but there are countless other ways to try and reduce features as well. It is also very reasonable to use a combination of many methods as you are adjusting models.

--

--