Next Level Feature Selection Method with Code Examples

Haitian Wei
3 min readOct 17, 2019

--

source

Introduction

In my previous article, I briefly talked about common feature selection method. In this article, I will go through more details of feature selection.

Permutation Importance

The core idea of permutation importance can be concluded as the answer to the question: If you randomly shuffle a single column of the validation data, leaving the target and all other columns in place, how would that affect the accuracy of predictions in that now-shuffled data?

As this kernel pointed out, the process is as follows:

Get a trained model.

Shuffle the values in a single column, make predictions using the resulting dataset. Use these predictions and the true target values to calculate how much the loss function suffered from shuffling. That performance deterioration measures the importance of the variable you just shuffled.

Return the data to the original order (undoing the shuffle from step 2). Now repeat step 2 with the next column in the dataset, until you have calculated the importance of each column.

One big advantage of permutation importance is that we don’t need to retrain models as forward or backward feature selection methods. So it would be a lot faster.

Another reason we should pay attention to permutation importance is that default importance is not always reliable, as this article pointed out:

The most common mechanism to compute feature importances, and the one used in scikit-learn’s RandomForestClassifier and RandomForestRegressor, is the mean decrease in impurity (or gini importance) mechanism.

The mean decrease in impurity importance of a feature is computed by measuring how effective the feature is at reducing uncertainty (classifiers) or variance (regressors) when creating decision trees within RFs. The problem is that this mechanism, while fast, does not always give an accurate picture of importance.

We can use eli5 package PermutationImportance to calculate permutation importance.

import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(my_model, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())

But in the case when data have NA values, we would write our own function.

Time Consistency

This is a trick I learned from Chris Deotte. Here is how he describe it:

One interesting trick called “time consistency” is to train a single model using a single feature (or small group of features) on the first month of train dataset and predict for the last month of train dataset. This evaluates whether a feature by itself is consistent over time.

In fact, feature’s time consistency is quite important because a lot of problems experience the so-called ‘concept drift’ challenge.

Adversarial Validation

In many cases, the train and test data have very different distributions, which is again ‘concept drift’ challenge. In this case, we want to find those variables that experience sever distribution shifts to exclude them from our model.

A simple way to achieve this is training a probabilistic classifier to distinguish train/test examples and find out which variables contribute most to this model.

If the variable predicts test/train split then there is a shift in its distribution between test and train and that may lead to the error in predicting test frauds. We should be very careful when using these features.

A good practice of adversarial validation feature selection can be found here.

Reference

--

--