Feature Selection Methods with Code Examples

Haitian Wei
Analytics Vidhya
Published in
6 min readAug 21, 2019

--

Why feature selection?

Feature selection is the process of finding and selecting the most useful features in a dataset. It is a crucial step of the machine learning pipeline. The reason we should care about feature selection method has something to do with the bad effects of having unnecessary features in our model:

  • overfitting, decrease generalization performance on the test set.
  • decrease training speed
  • decrease model explainability

Major Types of feature selection tools

In general, there are three types of feature selection tools(although I don’t know who defined it):

  • Filter based: Filtering approaches use a ranking or sorting algorithm to filter out those features that have less usefulness.
  • Wrapper-based: Wrapper methods consider the selection of a set of features as a search problem. Wrapper approaches generally select features by directly testing their impact on the performance of a model.
  • Embedded: Embedded methods use algorithms that have built-in feature selection methods. For instance, Lasso.

Now, let’s go through each method in more detail.

1 — Filter Based Method

Filter methods are usually applied as a preprocessing step.

source

1.1 — Variance Threshold

Variance thresholds remove features whose values don’t change much from observation to observation (i.e. their variance falls below a threshold). These features provide little value. We can easily apply this method using sklearn feature selection tools.

from sklearn.feature_selection import VarianceThreshold

1.2 — Correlation Threshold

Correlation thresholds remove features that are highly correlated with others (i.e. its values change very similarly to another’s). These features provide redundant information.

First we calculate all pair-wise correlations. Then, if the correlation between a pair of features is above a given threshold, we’d remove the one that has larger mean absolute correlation with other features.

1.3 — Univariate feature selection

Univariate feature selection examines each feature individually to determine the strength of the relationship of the feature with the response variable. There are lot of different options for univariate selection.

  • Pearson’s Correlation

One of the simplest method for understanding a feature’s relation to the response variable is Pearson correlation coefficient, which measures linear correlation between two variables.

  • Chi-Squared
source

In this method, we calculate the chi-square metric between the target and the numerical variable and only select the desired number of variable with the best chi-squared values.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
chi2_selector = SelectKBest(chi2, k=2)
X_kbest = chi2_selector.fit_transform(X, y)

If the features are categorical, calculate a chi-square (χ2) statistic between each feature and the target vector. However, if the features are quantitative, compute the ANOVA F-value between each feature and the target vector.

The F-value scores examine if, when we group the numerical feature by the target vector, the means for each group are significantly different.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
fvalue_selector = SelectKBest(f_classif, k=2)
X_kbest = fvalue_selector.fit_transform(X, y)
  • Mutual Information and maximal information coefficient (MIC)

A more robust option for correlation estimation is mutual information, which measures mutual dependence between variables.

It can be inconvenient to use directly for feature ranking for two reasons though. Firstly, it is not a metric and not normalized (i.e. doesn’t lie in a fixed range), so the MI values can be incomparable between two datasets. Secondly, it can be inconvenient to compute for continuous variables: in general the variables need to be discretized by binning, but the mutual information score can be quite sensitive to bin selection.

Maximal information coefficient is a technique developed to address these shortcomings. It searches for optimal binning and turns mutual information score into a metric that lies in range [0;1]. In python, MIC is available in the minepy library.

2 — Wrapper-based Method

Wrapper methods are based on greedy search algorithms as they evaluate all possible combinations of the features and select the combination that produces the best result for a specific machine learning algorithm. A downside to this approach is that testing all possible combinations of the features can be computationally very expensive, particularly if the feature set is very large.

Clearly this is a computationally expensive approach for finding the best performing subset of features, since they have to make a number of calls to the learning algorithm.

source

2.1 — Forward Search

In the first phase of the step forward feature selection, the performance of the classifier is evaluated with respect to each feature. The feature that performs the best is selected out of all the features.

In the second step, the first feature is tried in combination with all the other features. The combination of two features that yield the best algorithm performance is selected. The process continues until the specified number of features are selected.

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import roc_auc_score
from mlxtend.feature_selection import SequentialFeatureSelector feature_selector = SequentialFeatureSelector(RandomForestClassifier(n_jobs=-1),
k_features=15, forward=True,
verbose=2, scoring=’roc_auc’, cv=4)

2.2 — Backward Search

Step backwards feature selection, as the name suggests, is the exact opposite of step forward feature selection that we studied in the last section. In the first step of the step backwards feature selection, one feature is removed in round-robin fashion from the feature set and the performance of the classifier is evaluated.

The feature set that yields the best performance is retained. In the second step, again one feature is removed in a round-robin fashion and the performance of all the combination of features except the 2 features is evaluated. This process continues until the specified number of features remain in the dataset.

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import roc_auc_score
from mlxtend.feature_selection import SequentialFeatureSelector feature_selector = SequentialFeatureSelector(RandomForestClassifier(n_jobs=-1),
k_features=15, forward=False,
verbose=2, scoring=’roc_auc’, cv=4)

2.3 — Recursive Feature Elimination

It is a greedy optimization algorithm which aims to find the best performing feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration. It constructs the next model with the left features until all the features are exhausted. It then ranks the features based on the order of their elimination.

>>> from sklearn.datasets import make_friedman1
>>> from sklearn.feature_selection import RFE
>>> from sklearn.svm import SVR
>>> X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
>>> estimator = SVR(kernel="linear")
>>> selector = RFE(estimator, 5, step=1)
>>> selector = selector.fit(X, y)

3 — Embedded Method

source

Embedded methods use algorithms that have built-in feature selection methods. For example, Lasso and RF have their own feature selection methods. Lasso regularizer forces a lot of feature weights to be zero.

We can also use RandomForest to select features based on feature importance. We could also have used a LightGBM. Or an XGBoost object as long it has a feature_importances_ attribute.

References

--

--