Feature Selection Using Wrapper Methods in R

Kelly Szutu
Analytics Vidhya
Published in
4 min readApr 13, 2020
Photo by Edu Grande on Unsplash

What is Feature Selection? A process to filter irrelevant or redundant features from the dataset. By doing this, we can reduce the complexity of a model, make it easier to interpret, and also improve the accuracy if the right subset is chosen.

In this post, I will first focus on the demonstration of feature selection using wrapper methods by using R. Here, I use the “Discover Card Satisfaction Study ” data as an example.

cardData = read.csv(“Discover_step.csv”, head=TRUE)
dim(cardData)
head(cardData)

In this 244 x 15 data set, the second column “q4” is our dependent variable which indicates the overall satisfaction, and others are the questions being picked up from the survey. (You can find chosen survey questions below)

Then we check if there’s collinearity. Having collinearity means that when multiple independent variables of multiple regression are highly correlated. This brings about the difficulty to interpret the estimates of coefficients which can be biased.

library(psych)
IVs <- as.matrix(cardData[,3:15])
corr.test(IVs)

The cells with shadow are the values larger than 0.6, meaning that they have a strong correlation. We’re still not sure if this really influences the result, so let’s run the linear regression to check.

model <- lm(q4~., data=cardData[,-1]) #remove 1st column “id”
summary(model)

Indeed, we can see from the result that only a few independent variables are significant (p<0.05). Proves that the collinearity and overfitting problem is in this model. Hence, we do the variable selection to pick the key factors. There are three ways to use:

Forward Selection

Begin with no independent variables in the equation and successfully add one at a time until no remaining variables make a significant contribution.

library(MASS)
step_for <- stepAIC(model, direction=”forward”)
summary(step_for)

Forward selection seems not good enough to apply to this case. We can see the result is no different from the original one, the existing issue is still there.

Backward Selection

Start with all potential independent variables in the model and delete the most non-significant one at each iteration until further decision would do more harm than good.

step_back <- stepAIC(model, direction="backward")
summary(step_back)

After the selection, we got four crucial questions, which are “q5f”, “q5g”, “q5h” and “q5m”. Though R-square is lower, it gives us better performance on BIC (BIC = 667.36) and adjusted R-square (adjusted R-square = 0.2).

Stepwise Selection

Much like a forward selection, except that it also considers possible deletions (drop out the variables already in the model which turn insignificant and replace by other variables) along the way.

step_both <- stepAIC(model, direction=”both”)
summary(step_both)

Stepwise regression is a greedy algorithm that adds the best feature or deletes the worst feature at each round. Going back and forth like this usually helps us select more suitable variables, therefore, it becomes the most popular form of feature selection in traditional regression analysis.

In the next few posts, I will show other helpful methods: factor analysis and principal component analysis in feature extraction.

Chosen Servey Questions

About me

Hey, I’m Kelly, a business analytics graduate student with journalism and communication background who likes to share the life of exploring data and interesting findings. If you have any questions, feel free to contact me at kelly.szutu@gmail.com

--

--

Kelly Szutu
Analytics Vidhya

Journalist x Data Visualization | Data Analyst x Machine Learning | Python, SQL, Tableau | LinkedIn: www.linkedin.com/in/szutuct/