Methods of Feature Selection

8 min readApr 4, 2019

Data don’t come in simple form. After applying EDA methods, data although come in a form that we can understand but still it is in a complex form. That’s because we do not know which features are most important, which are little and which are not. Therefore, there is a need to dive deep to extract more meaning by increasing the understanding of the features. There comes the notion of Feature Selection, means selecting features which really matter. All features are not contributing enough to the meaning of data and too much features won’t give you a good model. This model will not be reliable one, the one with all the features.

There are five methods of building models :

1) All-In

2) Backward Elimination

3) Forward Selection

4) Bi-directional Elimination

5) Score Comparison

(2), (3) & (4) are also known as stepwise Regression as they are really a stepwise methods. Let’s discuss each one in brief.

1. All-In

It’s not a technical term, just used for simplicity to say that here we take all the features. Why? Because either of the prior knowledge or you have been asked to do so. Prior knowledge comes from domain knowledge and you are sure that all the features are required to build a model.

2. Backward Elimination

step i) Select a significance level (SL) to stay in the model.

step ii) Fit the model with all the (remaining) variables.

step iii) Check the variables for their P-value which is above SL.

step iv) Remove only one variable whose p-value is above SL.

step v) IF p-value of any variable is greater than SL, repeat step (ii) to step (iv) , else go to step (vi).

step vi) Stop

To understand above steps, we will take example of Taxi-demand-prediction-analysis project ‘s code.

Following are the 19 features names in this dataframe, over which we will operate to remove features whose SL would be above 5%.

First of all, we need to convert this pandas dataframe into array so that we can apply statsmodels.formula api to this dataset to get p-values of all the variables:

Notice here, we’ve applied statsmodels to all the rows and columns for the first time and our SL is 5%. Above code resulted in the following output:

Here, we found that x16 feature has the highest SL of 52%. So, we need to remove this now. We will copy the previous code and execute it after deleting the index of x16 variable. We need to be careful at this point while deleting the correct index of the variable as we are deleting it from the original dataset, keeping in mind that the index in python starts with 0. That’s why I created a dummy excel sheet to track each variable index. We are not required to remove all the features whose SL is above 5%, because removing the feature with highest SL itself affect SL of other features.

“Lat (latitude)” was the first variable to be deleted from this dataset. Now we will again apply the statsmodel code to check for the p-value of the variables. Notice, in the below code that index “x15” is missing, which corresponds the removed x16 feature with highest p-value.

Now, we get the following results about p-values of all the other variables:

Just removing variable with the highest SL of 52% in the previous step, itself decreased the p-values of other variables significantly. But still, variables have SL way more than desired of 5%. So, we will apply statsmodel code again and again, after removing one variable each time, until we get p-values of all the variables less than 5%.

Here, it is how we deleted the correct index of x14 feature, with the highest p-value:

Now, coming at the bottom line, we needed to apply statsmodel code 8 times, for this example of 19 variables to reach to data with 12 variables with SL level of all the variables below 5%, as shown below:

To get the details of each applied statsmodel code, please feel free to visit my code for this project whose link I’ve already shared above.

Below is the snapshot to excel sheet that I used to capture the correct index of the variable to be removed:

Before applying Backward elimination on this dataset, I had applied Logistic Regression, Random Forest and XGBoost models on this dataset. And, I got really excited, when I got the same results, when I applied above models after applying Backward Elimination on this dataset.

Before applying feature selection: Model and their error metric value

After applying feature selection: Model and their error metric value

So, the moral of the story is that, removing those variables which are not significant helped me to reduce the computation while applying model and now we are more sure of the feature which are really important for this dataset.

3. Forward Selection

step i) Select a Significance Level (SL) to stay in the model.

step ii) Fit the model with all the variables, but taken one at a time and select the feature with the least SL.

step iii) Again fit the model with all the features, but taken two at a time, except the feature that was selected in previous step and select the second feature where it has lowest p-value.

step iv) Now move to step (iii) but everytime we come to step (iii) we’ve to take one more feature together to check for p-value less than SL for this added feature. If no feature remains to take or p-value is more than SL, move to step (v).

step v) Stop

To understand above steps, we will again take example of Taxi-demand-prediction-analysis project ‘s code.

Following are the 19 features names in this dataframe, over which we will operate to remove features whose SL would be above 5% using Forward Selection.

First of all, we need to convert this pandas dataframe into array so that we can apply statsmodels.formula api to this dataset to get p-values of one variable:

Below is the p-value we get for the first variable:

Similarly, we applied above method for all the 19 features, one at a time, to record their respective p-values and it comes 0 (zero) for all of them, here in this case.

We can select any feature as first feature, so I’m here selecting the first one as our first feature with p-value under SL. Now, we are required to apply above methods on two features taken at a time with the one selected feature of the previous step, as below:

After applying this step to rest of the features,we will select the second feature with the lowest p-value, under SL. At the end of this step too, we’ll select only one feature as our second feature.

Every time, we’ll repeat this step, we need to add one more feature than the number of feature in the previous step, to select only one feature in this step with p-value under SL.

So, either we’ll run out of features or we’ll get p-value more then SL. This is the time to stop. And our feature selection process is complete with the features selected in the previous step. We can now move ahead and build different models.

4. Bi-directional Elimination

It is the combination of both Backward Elimination and Forward Elimination.

step i) Select a significance level (SL) to stay and enter in the model. There would be two SLs, one used for backward elimination process and the other one for forward elimination process.

step ii) Apply the next step of Forward selection. (Select one variable at a time and check for their p-values and select one feature whose p-value is less than SL)

step iii) Apply all the steps of Backward Elimination. (This step will really start when we’ve selected at least two variables in step(ii) above.) We’ll check if we can get rid of any selected variable with step(ii).

step iv) Repeat steps(ii) and (iii), until no new variables can be added or no old variable can be exit from the selected features and go to step(v).

step v)Stop

Here, I’ll stress a bit on step(ii) above. Suppose, we have selected 4 variables with p-values under SL which executing step(ii) above at some point of time. Now, apply step(iii) on this 4 selected variables. If p-value of any of these 4 variables is out of SL, get rid of it then. Now again apply step(ii) to add a new variable, and so on.

5. Score Comparison

It is the most resource consuming approach.

step i) Select a criterion of goodness of fit.

step i) Construct all possible regression models: 2^n -1 total combination, where n is the number of features.

step iii) Select the model with the best score on the parameter chosen in step(i) above.

But, suppose we’ve 10 features in data, there would be then 1024 model, which is really very vast to consider. So, this is not a good approach when we’ve a large number of features in the dataset.

So, that’all for today. Hope it may help somewhat to understand these concepts.

Happy Analysing! :-)

References :https://www.udemy.com/machinelearning/

https://www.appliedaicourse.com/

Methods of Feature Selection

Written by Abhishek Kumar