Feature Selection Using Statistical Testing

6 min readFeb 28, 2018

Introduction

Feature selection is one of the most common yet challenging parts of machine learning system design. Often one acquires or engineers a brand new shiny feature which simply has to improve the model. But how to find out if it really makes the difference? Especially if this feature takes time to compute or requires expensive third-party data.

Naive Approach

One way is to train the model, measure its performance on the validation set then plug in a new feature, repeat the process and compare the two scores. If the second score is higher, then the feature works! However, such an approach depends heavily on the training / validation split and just by moving objects from training to test sample or vice versa we can obtain a complete different result.

Another solution is to use a cross-validation scheme with k-folds or even leave-one-out scheme. In that case each of the k parts of the data in its turn is assigned to be a validation sample on which score is computed and the model is trained on the other k-1 parts. Here we obtain k scores (up to the number of observations in case of leave-one-out scheme) and by averaging them one gets quite a stable estimate of the true score. This scheme is much better although the estimate is still depends on the folds selection.

Example of 3-fold cross-validation scheme

The Proposed Approach

Multiple cross-validations

Why should one limit oneself to cross-validation with fixed k-folds if one can go deeper? It’s completely possible to perform several different k-fold cross-validations (say, m crossvals) for the same model. It goes without saying that the folds should be different each time. Hence we obtain k × m scores and their average provides much better estimate of the real score.

Statistical testing

If the base model yield say 80% accuracy score and the model with a new feature yields say 81% score can we conclude that the feature makes the difference? Actually we can’t — without knowing the variance of these scores. For example, if the scores of the first and second models range +/- 10% from the average value then it’s likely that the performance of these two models are identical. However, if the scores vary only +/- 0.1% from the average, then the second model is much better.

In such moments of doubt statistics comes to rescue. Remember that one has k × m scores obtained for each of the models. (Note: it’s important for the folds on which the scores have been computed to be identical for each of the competing models!). Then one can simply subtract all of the corresponding scores from each other and get k × m differences and calculate their average X and standard deviation S. The final measure of difference (statistic) t between the two arrays of scores would be the average difference X adjusted by standard deviation S and number of differences n.

Statistic for dependent Student’s t-test for paired samples. Here a and b are the dependent samples (multiple crossval scores for each of the two models), k — number of folds, m — number of cross-validations, X — average difference of scores, S — standard deviation of difference of scores, n — number of scores for each model.

Actually now we are looking at dependent Student’s t-test for paired samples. Our null hypothesis is that there is no difference between the two samples of scores. If the null hypothesis is true then this statistic t must come from Student’s t-distribution with n-1 degrees of freedom (and probably be close to zero). However if the value of t is too extreme (very large positive or negative) then we may conclude that the two samples of scores differ significantly. Statisticians usually just say: let’s assume we erroneously reject the null hypothesis in 5% of cases (significance level α=0.05). Then they calculate the p-value (probability to get such or more extreme statistic value under null hypothesis) and if p-value < α then it’s considered that the difference between samples is significant — one of the models performs significantly better. Student’s t-test is built in many statistical packages, for example in Python it is implemented in scipy.stats.ttest_rel function.

Mind the outliers

The bad news is that often the tested feature may happen to be a good predictor for an outlier object (outlier is the object which features differ significantly from the rest of sample in their values or in their relationship with the target variable) and absolutely useless for the rest of the data. So generally one would like to prevent the outliers from influencing our unbiased judgement. Any clever procedure for detecting outliers can be used for that purpose, however I would like to propose a universal straightforward method. Let’s say one has a reason to believe that there are roughly 1% of outliers in the data. First, one should fit the model without new feature on training data and measure the prediction errors. Then one should eliminate 1% of objects with the highest errors from the sample. Finally, the comparison is done using the same multiple cross-validation and statistical testing scheme but without the outliers.

Do have in mind though that we are removing outliers only for comparison procedure. Removing them from training sample for fitting the final model should be performed with great care (or not performed at all) and removing them from test data would potentially lead to overly optimistic estimation of model performance.

Important notes

The advantage of the proposed method is that you can apply it to any machine learning model, as well as choose any appropriate scoring. Also its possible to test the addition of several features to the model at once.

Nevertheless, one shouldn’t forget that the statistical significance doesn’t imply the practical significance. Statistical significance merely indicates that there IS some stable difference between two samples no matter how small. That’s why one should always look at the increase of the average score and ignore the features which don’t provide the required improvement.

Another potential caveat is the multiple testing problem. In short, the problem is that the probability of erroneously considering a feature significant increases with the number of features checked. There is a nice XKCD comic strip that illustrates the problem. My approach is just to be aware of this issue and try to perform fewer tests with features that make sense as well as looking for practically significant increase of scores. However if it’s critical for you not to choose even single unimportant feature or if you make a lot of tests at once I recommend to use some controlling procedures like Bonferroni correction or Holm–Bonferroni method.

In addition, it should be noted that such procedure with many cross-validations is quite time-consuming so i won’t recommend it to resource-intensive learning algorithms like neural networks.

Other Applications

Strictly speaking, the feature set is just one of the hyperparameters of your machine learning system that can be tuned. I consider the following hierarchy of hyperparameters:
1. machine learning algorithm (e.g. Ridge regression or Random forest, etc.);
2. feature set (the features that are used to train the algorithm);
3. hyperparameters of the learning algorithm (like regularization parameter in Ridge regression or number of trees in Random forest, etc.).

The proposed approach can be applied to each of these levels.

In my own pipeline the learning algorithm is fixed, the feature set is chosen by the proposed approach and the hyperparameters are tuned using standard cross-validation or leave-one-out scheme.

References

https://habrahabr.ru/company/ods/blog/336168/
I borrowed the main idea from this post where the author used a similar approach to select the optimal hyperparameters for XGBoost model

Thanks for reading!

I would really appreciate the feedback or any suggestions/discussions. You can reach me via mail uvarov.vadim42@gmail.com or Telegram @v_uvarov.
Let’s connect on LinkedIn https://www.linkedin.com/in/vadim-uvarov-6478b697/