A Deeper Look into Feature Selection

Anya Pfeiffer

by Oliver Knocklein, Anya Pfeiffer, & Saiteja Suvarna

By the end of this article you should be able to know what feature selection is and why it is important.

  1. Know what the stepwise feature selection is, and how it relates to forward and backwards feature selection.
  2. Know under what circumstances interaction terms should be included in the model
  3. Be able to understand, and be able to differentiate, between Ridge and Lasso Regression.
  4. Finally, be able to understand what ANOVA testing is used for, when to use each of the two types of ANOVA tests and how to test for feature impact using Python and/or R code.

What is Feature Selection?

Most data science, and machine learning, questions of interest are complex, and usually require large datasets, with many features to adequately explain the underlying process. However, just because data and variables are available, does not mean that they should be used. The goal of feature selection is to build the best possible model, without redundant and irrelevant variables. Extraneous variables contribute to model complexity, make inference harder, and potentially alter the true relationship of interest. A reduction of dimensionality is thus desirable, if it can be achieved without sacrificing model fit.

Stepwise Regression

Stepwise selection is a method of producing the optimal regression model. Optimal regression model can be evaluated through a variety of fit metrics, with the most common being the adjusted R squared, the various information criterions (AIC, BIC, AICc), the F test of overall significance, and finally the t tests for individual predictors. There are three main methods of selecting the features to include in a regression model are all variations of Greedy algorithms, and are: forward selection, backwards selection, and stepwise selection.

Backwards selection is when the initial model contains all available predictors. After the model is fitted, the predictors are evaluated for statistical significance. If there exist statistically insignificant predictors (predictors who’s p values are greater than a predetermined level of alpha), then the least significant predictor is removed from the model, and the model is retrained on the data, and evaluated. This continues until all remaining predictors are statistically significant.

Forwards selection is implies the same method as backwards selection, just in reverse order. Here the model starts with just the intercept term, and one then cycles through the possible predictors, adding the most significant one into the model. This continues until no more significant predictors can be added.

Stepwise selection is a combination of forwards and backwards selection, with variables potentially being added or dropped at each iteration. Thus it is the most computationally expensive of the methods, and is usually an automated process. Stepwise selection has been the topic of much criticism recently, with pundits pointing out biased R squared values, artificially reduced standard errors of variables, and alteration of the distributions of the F-statistics. Thus, some caution that such feature selection should best be avoided, and methods such as Lasso or Ridge regression utilized instead. For a small number of features, one could run all possible combinations of features, and rank the models from there, but with many features, such a method is computationally too expensive.

To demonstrate the differences in model performance based on feature selection, a dataset on shopping behavior in a mall was downloaded. Included were several variables which could be used to explain spending habits. Since python does not support automated stepwise regression, and since there is currently an issue with statsmodels.api which makes the creation of a manual evaluation function prohibitively complex, the selection and regression will be run using established and verified functions in R. Below one can see the predictors included in the final model suggested by the above feature selection techniques, along with the respective fit metrics.

One can see that different selection techniques can suggest different optimal models. The fact that several methods provide the same suggestion should not be a surprise, as there were only three potential independent variables.

Interaction Terms

Interaction terms are needed when the effect that one independent variable has on the dependent variable depends on the value of a different independent variable. In the above dataset, an interaction term between gender and annual income would be justified if the slope of a scatter of points between spending and income, was different for males than it would be for females. Below is a picture of when an interaction term would be justified.

Image from: https://newonlinecourses.science.psu.edu/stat501/node/307/

Here one can see that the relationship between y and age varies depending on the value of TRT, this can easily be seen by visualizing slopes between points of the same shape and colour. The slope for the green points is steeper than the slope for blue lines, for example.

In the dataset above, one might be interested in whether or not an interaction term between income and gender would be justified. The plot below, however does not suggest a significant difference in slopes dependent on gender, thus suggesting that such an interaction term is not statistically significant. This is confirmed by including an interaction term between in the model suggested by stepwise regression, and observing that the interaction term has a p value of 0.373, which is insignificant at any reasonable alpha value.

If interaction terms do appear to be significant, they should be included in the model. If interaction terms are included in the model, one must consider whether or not to include the main effects for the terms. Not including them, may or may not be prudent. Excluding them is akin to saying that the components of the interaction term do not have any bearing on the dependent variable on their own.

Ridge & Lasso Regression

Both ridge and lasso regression attempt to prevent the over-fitting that a linear regression might create. These methods add a “penalty” to the cost function in order to reduce or eliminate the effects of features that aren’t as important to creating the model. The less influential a feature is in the model, the higher the penalty will be.

Below are predicted values for the spending score of a user based on a linear regression, which can be compared against the values generated for ridge and lasso regression.

Ridge regression shrinks the values of coefficients associated with different features. Due to the structure of the penalty associated with this method, features that are determined to be less influential on the model receive a higher penalty than those that are influential. This can be thought of as the model being made less susceptible, in a sense, to the training set, which prevents the model from overfitting. While this might produce a model that does not fit the training set as well (higher bias), it is also more likely to create a better model over time, as the model allows for lower variance. In the equation below, the last term represents the penalty, where the square of the coefficient is taken into account. In the visualization, the slope of the ridge regression line is getting closer to zero in order to account for the disparity in actual values. This reduces variance, but again has a higher bias than the initial linear regression model.

Equation from: https://towardsdatascience.com/ridge-and-lasso-regression-a-complete-guide-with-python-scikit-learn-e20e34bcbf0b

Lasso, or Least Absolute Shrinkage and Selection Operator, also reduces the coefficients of features that are less important. Unlike ridge regression, however, lasso regression allows the coefficients to be reduced to zero. This can be particularly useful in scenarios where there are a significant number of features to be pared down, as fewer features will be included in the model (ridge regression will still include these features, it just reduces their effects — lasso regression disregards them completely). In the formula below, the final term of the equation represents the penalty, which takes into account the magnitude of the coefficient. In the visualization, the lasso regression model has a slope that is closer still to zero than the ridge regression model. Again, the variance in this model should be the lowest of the three, but the bias is higher. Below the visualization is the code snippet used to run and display this regression.

Equation from: https://towardsdatascience.com/ridge-and-lasso-regression-a-complete-guide-with-python-scikit-learn-e20e34bcbf0b
#lasso regression#fit the model based on the training set
lasso.fit(x_train, y_train)
#predict y values based on lasso regression
lasso_y_predict = lasso.predict(x_test)
#display results
plt.scatter(x_test['Age'], y_test)
plt.scatter(x_test['Age'], lasso_y_predict)
plt.title('Lasso Regression')
plt.ylabel('spending score')

Both of these methods make use of a lambda value. As lambda increases, the bias of the model increases, but the variance decreases. The optimal value of lambda essentially produces as much bias as the user is willing to allow in order to decrease the variance.

Variance Threshold

Using a variance threshold allows for the removal of features that don’t differ very much within themselves. The idea behind removing these features is that measures that have low variance are likely to have low predictive power. In order to execute variance threshold, the variance of each feature would be calculated, and then any feature that was below the selected threshold or cutoff would be dropped. Using VarianceThreshold() from sklearn.feature_selection, users can set the threshold to a value of their choice. If not specified, the default threshold is zero, which would drop features with zero variance (any features that have the same value for all samples). Below is code showing a variance threshold used on this dataset. In the array, only four columns are included, showing that the “Gender” column was dropped, as it had very low variance.

#perform variance threshold on the entire dataset
threshold_obj = VarianceThreshold(threshold=0.5)
x = threshold_obj.fit_transform(df)#display the first five rows of data

ANOVA Testing

Introduction to ANOVA

ANOVA, or Analysis of Variance is a statistical technique that is used to see if the means of two or more samples are significantly different from one another. The test can also be used to see the impact of one or more independent variables on the dependent variable being measured. In this article, we can use the ANOVA test to see if Age and Annual Income have an effect on the Spending Score of individuals at a mall. This can be done by comparing the means of the two different samples that have been created from the Kaggle dataset that was obtained. Alternatively, a T-Test could be used to obtain the same objective of an ANOVA test. However, the issue lies in the fact that the T-Test can only compare means of at most two samples of data. If multiple T-Tests are conducted on the two samples collected from the dataset, the error rate from each will compound and create a skewed result and faulty conclusions.

Important Background for ANOVA Testing

The important terms that one needs to know before conducting ANOVA tests are as follows: Grand Mean, Hypothesis, Between Group Variability, Within Group Variability, F-Statistic

Grand Mean

There are two kinds of means that are used in calculations of ANOVA. The first is a simple mean of each sample that was obtained from a dataset or set of observations and the second is a Grand Mean. The Grand Mean is the mean or average of all the individual sample means that have been collected. The grand mean is denoted as follows:


Depending on which variable we are testing, each hypothetical mall situation explained above can have two possible hypotheses: age will have an impact on the spending score or age will not have an impact on the spending score and annual income will have an impact on the spending score or annual income will not have an impact on the spending score and annual income. These statements are known as hypotheses but are written in a different form in the language of statistics. Like every other hypothesis in statistics, there is a Null hypothesis and an Alternative hypothesis. The Null hypothesis asserts that all the sample means are the same. In other words, the null hypothesis also states that the independent variable has minimal or no effect on the sample outcomes that are being measured. On the other hand, the alternate hypothesis asserts that at least one of the sample means are different from the other sample means that have been measured. In other words, the alternate hypothesis also asserts that the independent variable does have a noticeable or significant impact on the sample outcomes that are being measured. The sample and alternate methods are written formally below:

Between Group Variability

Check out the distribution graphs of the Sample 1 and 2 Spending Scores for the collected below. These two distribution plots overlap significantly and their individual means don’t differ greatly and the differences between their means the grand mean will also be very small.

However if you look at the distribution plot shown below, the distributions of each sample are much more different by a significantly large margin and their individual means also differ significantly. Therefore, the difference between the individual sample means and the grand mean would also be greater than in the distribution plot shown above.

This variability between the two distributions that were shown above is known as Between Group Variability and it refers to the variability between two or more separate samples of data in a set of observations. Given the sample means the grand mean, here is how the calculation for between group variability is done:

The average squared sum between groups is simply the calculation shown above divided by k-1 where k is the number of sample means.

Within Group Variability

Within group variability refers to the variation caused by the differences in the data points within each sample taken. Each sample is looked at individually and the variation between each datapoint and the mean of the sample is observed. The interaction between each sample in relation to another is not looked at. Sum of squares for within group variability is calculated below:

The average squared sum within groups is simply the calculation shown above divided by Ni-k where k is the number of sample means and Ni is the number of values in each sample taken.


The F-Statistic or F-Ratio is the measure by which one concludes whether the means of different samples are statistically and significantly different or not. The lower the F-statistic the more similar the sample means are and thus we cannot reject the null hypothesis. However, if the F-statistic or ratio is higher, the sample means are less similar and thus we can reject the null hypothesis and default to accepting the alternative hypothesis. The F-statistic is calculated below:

F = Between group variability / Within group variability

When calculating ANOVA, an F-Critical Value is automatically calculated for you. Based on this F-Critical, we can decide whether to accept or reject the null hypothesis. If the F-Statistic is higher than the F-Ratio, then we reject the null hypothesis, if the F-Statistic is lower than the F-Ratio, then we accept the null hypothesis.

Main Type of ANOVA Test

The main type of ANOVA test being used is the One-way ANOVA test:

A one-way ANOVA tells us whether at least two groups are statistically and significantly different from each other or not. In the dataset that we chose, the One-way ANOVA test is being used to test whether the two samples created (which contain Spending Scores based on Age) are significantly different. In other words, this test will show whether age plays a role in the spending patterns and habits at malls. First, each statistic needed to come to a conclusion is calculated individually using certain pandas and numpy functions. Then, an inbuilt One-Way ANOVA test function is used to reinforce results.

As shown above, there is a relationship between age and spending score. However, the ANOVA test is needed to check whether this relationship is significant or not.

First, the describe() function is used to explore the data a bit more and obtain the mean of the dependent variable being measured (spending score).

Then, the value of sum of squares between group variability, mean sum of squares between group variability, sum of squares within group variability, and mean sum of squares between group variability were calculated to finally calculate the F-Statistic. The F-Statistic turned out to be about 23.85453 which is very large and caused me to reject the null hypothesis and accept the alternate hypothesis. This means that age does play a significant role in the spending patterns.

sample18to35 = sample18to35.drop(['Gender', 'CustomerID', 'Age', 'AnnualIncome'], axis=1)
sample35to52 = sample35to52.drop(['Gender', 'CustomerID', 'Age', 'AnnualIncome'], axis=1)
OneWayAnovaTestSpendingCoeff = stats.f_oneway(sample18to35, sample35to52)

Another way to conclude an ANOVA test is to compare the p-value with an alpha value. If the p-value is larger than the alpha level declared then the null-hypothesis is rejected, otherwise the null-hypothesis is accepted and the alternative hypothesis is accepted.

Although the One-way ANOVA test is a good test to use, there are some limitations to it. The main limitation to it is that although we know that the factor does impact the dependent variable that is being tested for, we will never know which sample was different and which wasn’t.

Conclusion: The use of these methods in order to perform feature selection can be critical in constructing useful models. By reducing extraneous or unrelated features and variables, the best possible model can be created.

Key Terms: stepwise feature selection, backwards feature selection, forwards feature selection, interaction terms, lasso regression, ridge regression, lambda, variance threshold, ANOVA testing, grand mean, hypothesis, between group variability, within group variability, F-statistic

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade