THE DEFINITIVE GUIDE

Feature Selection: Wrapper Methods

5 Wrapper-based methods to choose relevant features

Elli Tzini
Analytics Vidhya

--

Photo by Marius Masalar on Unsplash

Table of contents

This post is the second part of a blog series on Feature Selection. Have a look at Filter (part1) and Embedded (part3) Methods.

Wrapper Methods

Flow chart

In part 1, we talked about Filter methods, which help you select features that are related to the target variable. However, I feel that we are leaving something out… Oh yes, the actual Machine Learning algorithm!

That is actually one of the core differences between Filter and Wrapper methods. The latter measures the importance of a subset of features by actually training a model on it.

The whole idea builds upon greedy search and evaluation criterion. Greedy because the method at each iteration chooses the locally optimal subset of features. Then, the evaluation criterion plays the role of the judge. It is actually the performance metric we are using while training our model. We all know that this entirely depends on the task. For Regression, it can be p-value or R-squared, while for Classification, accuracy or f1-score. At the end, the combination of features that gives the optimal results, according to the evaluation criterion, will be selected.

Choosing the Optimal Model

We need a way to find the best model given an evaluation criterion. For that cause, we need to turn our attention to the test error. It is known that the training error may be a poor estimate of the test error, thus, we focus on Cross-validation (CV); a powerful approach that estimates the test error.

For every method, we use Credit Card Fraud Detection Dataset: The dataset contains transactions made by credit cards; they are labeled as fraudulent or genuine. The dataset includes 30 features, which are the result of PCA. Due to confidentiality issues, the original features are not provided. This is a classification task with a quite imbalanced dataset and therefore, I will use f1-score (macro avg) as an evaluation metric.

Forward Selection

We have a null model (a model with no predictors) as a starting point. In the first round, we train with k-fold CV 30 models (why 30? because we have 30 features/variables). We then add the variable to the null model that gave the lowest test error. In the second round, we repeat the same idea fitting 29 models and add to the one-variable model the second variable, resulting to a two-variable model. This approach continues until there is no statistically significant improvement of the fit (stopping criterion).

import pandas as pd
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, train_test_split
credit = pd.read_csv('creditcard.csv')
X = credit.loc[:, credit.columns != 'Class']
y = credit['Class']
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2,
stratify=y,
random_state=42)
scv = StratifiedKFold(n_splits=5)# create the SequentialFeatureSelector object
sfs = SequentialFeatureSelector(LogisticRegression(C=10,
max_iter=10000), k_features=8,
forward=True,
floating=False,
verbose=2,
scoring='f1_macro',
cv=scv,
n_jobs=-1)
# fit the object to the training data
sfs = sfs.fit(X, y)
selected_features = wine.columns[list(sfs.k_feature_idx_)]
print(selected_features)
Selected Features

We now use these features to train the models:

# Build full model with selected features
clf = LogisticRegression(C=10, max_iter=10000)
clf.fit(X_train[:, feat_cols], y_train)
y_test_pred = clf.predict(X_test[:, feat_cols])
print(classification_report(y_test, preds))
Classification Report; FS

Backward Elimination

It is also known as Recursive Feature Elimination. This approach is the opposite of the previous one. We start with all the variables and remove one variable at a time, which is the least important. The code is exactly the same but there is one difference; we need to change the parameter forward to False inSequentialFeatureSelector.

These are the most important features and the classification report when we train on this subset:

Selected Features
Classification Report; BE

Forward vs Backward

Backward Elimination cannot be used if number of features > number of samples, while Forward Selection can always be used. The main reason is because the magnitude of reducible and irreducible error becomes Zero in this case. However, this is not possible, because it means that you have a perfect fit to the data. So, what is actually happening is that you are overfitting. This is not a problem with Forward Selection, as you start with no features and successively add one at a time.

On the other hand, Forward Selection is a greedy approach, and might include variables early that later become redundant.

Boruta

The algorithm is pretty interesting and if you ask me, I think it is really smart. So far, the whole idea in Wrapper methods is features competing against each other for a position in the final subset. Instead in Boruta, the features compete with a randomized version of themselves (shadow features). A feature is selected only if it performs better than the best performing randomized feature.

source

The procedure is repeated N times (number of trials) and the final step is to compare the number of times a feature was better than its shadow features using a binomial distribution. Concerning the features in the blue area, Boruta algorithm is indecisive, while features in green and red should be selected and eliminated, respectively. You can read more about it here.

import pandas as pd
from boruta import BorutaPy
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, train_test_split
credit = pd.read_csv('creditcard.csv')
X = credit.loc[:, credit.columns != 'Class']
y = credit['Class']
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2,
stratify=y,
random_state=42)
clf = RandomForestClassifier(n_estimators=200, class_weight='balanced', max_depth=4, n_jobs=-1)boruta = BorutaPy(estimator = clf, n_estimators = 'auto', max_iter = 100) # number of trials to performboruta.fit(X_train.to_numpy(), y_train)green_area = X.columns[boruta.support_].to_list()
blue_area = X.columns[boruta.support_weak_].to_list()
print('Green area:', green_area)
print('Blue area:', blue_area)
Selected Features
clf = RandomForestClassifier(n_estimators=200, class_weight='balanced', max_depth=4, n_jobs=-1)
clf.fit(X_train[green_area].to_numpy(), y_train)
preds = clf.predict(X_test[green_area].to_numpy())
print(classification_report(y_test, preds))
Classification Report; Boruta

Note: Boruta needs a classifier or regressor that has the attribute feature_importances_

Genetic Algorithm

A little bit more advanced method but extremely interesting in my opinion. Does the term Evolutionary Algorithms ring a bell? If yes, then you already know what this is all about, if not read this. Long story short, scientists inspired from genetics and natural selection to come up with Evolutionary Algorithms that determine the fittest individuals from a population. There are 5 stages:

Initialization

Create a population of solutions in a random manner — arbitrary number of different subsets of features i.e. N. But how does it look like? It is simply a binary vector of dim = number of features. 1 indicates presence of feature, while 0 absence. For example, the vector [1,0,0,1] indicates that we have in total 4 features and this member of the population is a subset of feature 0 and 3.

Fitness Assignment

Now it is time to evaluate each member of our population. For that cause, we train N models, each with a member from the population and calculate the error. If the error is large then the fitness is low. All errors of the individual N models are sorted but the fitness value depends squarely on the position of the individual in the rank and not at the error.
The fitness value: Φ(i) = k•R(i), k is a constant belonging in [1,2], R(i) is the rank of individual i. The smaller the error the higher the rank so the larger the fitness value.

Selection Process

How do we select? roulette-wheel or the least fun name stochastic sampling with replacement. This method places all members of the population on a roulette, with areas proportional to their fitness. Then we select half of the population at random by “turning the roulette”.

Selection Process

Crossover Operation

In this step, we choose 2 members randomly and combine their features to get 4 offsprings for the new population. Then we repeat with other 2 members and so on. When do we stop? When we generate the same size of population as in the initial step.

Mutation Operation

This step exists because of the latter. During Crossover there is a high chance of generating offsprings that are very similar to their ancestors, which leads to low diversity. Thus, we mutate the offsprings by changing the value of some features of the offsprings at random.

You repeat till the next generation does not significantly differ from the previous one. Then you select the member that produced the lowest error.

import pandas as pd
from genetic_selection import GeneticSelectionCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, train_test_split
credit = pd.read_csv('creditcard.csv')
X = credit.loc[:, credit.columns != 'Class']
y = credit['Class']
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2,
stratify=y,
random_state=42)
scv = StratifiedKFold(n_splits=5)selector = GeneticSelectionCV(LogisticRegression(C=10, max_iter=10000, random_state=42),
cv=scv,
verbose=2,
scoring="f1_macro",
max_features=5,
n_population=50,
crossover_proba=0.5,
mutation_proba=0.2,
n_generations=40,
crossover_independent_proba=0.5,
mutation_independent_proba=0.05,
tournament_size=3,
n_gen_no_change=10,
caching=True,
n_jobs=-1)
selector = selector.fit(X_train, y_train)# get the selected features
cols = X.columns.tolist()
selected_feats = [cols[i] for i in np.where(selector.support_)[0]]
# train and test
clf = LogisticRegression(C=10, max_iter=10000, random_state=42)
clf.fit(X_train[selected_feats], y_train)
preds = clf.predict(X_test[selected_feats])
print(classification_report(y_test, preds))
Selected Features
Classification Report; GA

What happens though if we do nothing? If we do not remove any features…

Classification Report; No feature Selection

Obviously, by using any of the above methods we gain from 7–14% in f1-score (macro avg).

Conclusion

Wrapper methods measure the importance of a feature based on its usefulness while training the Machine Learning model on it. On the other end, Filter methods select features based on their relation to the target variable. Unfortunately, there is no recipe to follow. You may wish to use Filter methods because they are computationally less expensive or a combination of Filter and Wrapper methods; also known as Hybrid Feature Selection. First, use Filter methods to eliminate redundant features and then, Wrapper methods to select a useful subset. However, it all comes down to data; this will determine your approach.

--

--