Feature Selection in Machine Learning

Sameer Kumar
4 min readOct 1, 2020

--

Introduction

In a Data Science project , there are six major steps that are followed:

  1. Obtaining the right dataset
  2. Exploratory Data Analysis
  3. Feature Engineering
  4. Feature Selection
  5. Hyperparameter tuning
  6. Model creation
  7. Deployment of model

In this article, let us focus on Feature Selection which is considered to be one of the most important steps in increasing the accuracy of the model.

Feature Selection

“Too many cooks spoil the broth”

Let us consider you have a model M1 has which has 4 features in it with an accuracy of A1. Now we keep on adding features to our model and it is observed that our model accuracy keeps on increasing .

After adding too many features, it is observed that our accuracy of the model starts decreasing. This is called as Curse of dimensionality.

Why does this happen?

Till a particular threshold value, our model is able to extract more and more information from the data but when the features start increasing exponentially, the model gets confused as we are feeding it a lot of data.

So how to solve this problem?

We only select the relevant features which have a direct or an indirect impact on the model and omit the non relevant features to prevent curse of dimensionality.

This is called as Feature Selection.

Feature Selection is nothing but the process of selecting the best subset of relevant features having an impact on target variable. Selecting the right features ultimately improves the performance of the model.

Advantages of Feature Selection

  1. Reduces Overfitting
  2. Improves accuracy as less misleading data
  3. Reduces training time, cost and complexity of model.

Techniques to select right features

Let us discuss some of the important methods:

1) Feature importance

This technique gives us a score for each of the feature in the dataset. Higher the score, the more relevant it is.

Here is the link of the mobile price classification dataset and by the graph mentioned above, we can conclude the importance of all features.

2) Correlation with Heatmap

The correlation matrix gives us an insight on all the features and how those features are dependent on the target variable. Compare the coefficients the dependent features with the independent features and decide a threshold value(eg.0.2) above which you can consider all features relevant as they have an impact on the target variable. The values closer to 1 indicates that the feature has an impact on the model whereas a value closer to zero indicates no impact or very less impact.

3) Forward Selection

Forward Selection is an iterative method in which we start with no feature in our model. In each iteration, we keep adding the features which improves the accuracy of our model till the addition of a new variable does not improve the performance

This method is very time consuming therefore it is preferred for only small datasets.

These are some of the important techniques to select the right features.

Conclusion

There are many other methods to select the right features like univariate analysis, wrapper method, dropping constant features by Variance Threshold method which I will discuss in my next article :)

Until then, do read my article on Kaggle’s Titanic Problem Statement.

You can connect with me on LinkedIn :)

If you liked my article, do leave a clap guys:)

Thank you!

--

--

Sameer Kumar

AI Research intern at SCAAI || Kaggle 2x Expert || Machine Learning || Deep Learning || NLP || Python