Feature Selection in Machine Learning

4 min readOct 1, 2020

Introduction

In a Data Science project , there are six major steps that are followed:

Obtaining the right dataset
Exploratory Data Analysis
Feature Engineering
Feature Selection
Hyperparameter tuning
Model creation
Deployment of model

In this article, let us focus on Feature Selection which is considered to be one of the most important steps in increasing the accuracy of the model.

Feature Selection

“Too many cooks spoil the broth”

Let us consider you have a model M1 has which has 4 features in it with an accuracy of A1. Now we keep on adding features to our model and it is observed that our model accuracy keeps on increasing .

After adding too many features, it is observed that our accuracy of the model starts decreasing. This is called as Curse of dimensionality.

Why does this happen?

Till a particular threshold value, our model is able to extract more and more information from the data but when the features start increasing exponentially, the model gets confused as we are feeding it a lot of data.

So how to solve this problem?

We only select the relevant features which have a direct or an indirect impact on the model and omit the non relevant features to prevent curse of dimensionality.

This is called as Feature Selection.

Feature Selection is nothing but the process of selecting the best subset of relevant features having an impact on target variable. Selecting the right features ultimately improves the performance of the model.

Advantages of Feature Selection

Reduces Overfitting
Improves accuracy as less misleading data
Reduces training time, cost and complexity of model.

Techniques to select right features

Let us discuss some of the important methods:

1) Feature importance

This technique gives us a score for each of the feature in the dataset. Higher the score, the more relevant it is.

Mobile Price Classification

classify mobile price range

www.kaggle.com

Here is the link of the mobile price classification dataset and by the graph mentioned above, we can conclude the importance of all features.

2) Correlation with Heatmap

The correlation matrix gives us an insight on all the features and how those features are dependent on the target variable. Compare the coefficients the dependent features with the independent features and decide a threshold value(eg.0.2) above which you can consider all features relevant as they have an impact on the target variable. The values closer to 1 indicates that the feature has an impact on the model whereas a value closer to zero indicates no impact or very less impact.

3) Forward Selection

Forward Selection is an iterative method in which we start with no feature in our model. In each iteration, we keep adding the features which improves the accuracy of our model till the addition of a new variable does not improve the performance

This method is very time consuming therefore it is preferred for only small datasets.

These are some of the important techniques to select the right features.

Conclusion

There are many other methods to select the right features like univariate analysis, wrapper method, dropping constant features by Variance Threshold method which I will discuss in my next article :)

Until then, do read my article on Kaggle’s Titanic Problem Statement.

Machine Learning Project: Titanic Problem Statement

Introduction

medium.com

You can connect with me on LinkedIn :)

Sameer Kumar - Founder of Delhi Capitals blog page - SportsHosts | LinkedIn

View Sameer Kumar's profile on LinkedIn, the world's largest professional community. Sameer has 3 jobs listed on their…

www.linkedin.com

If you liked my article, do leave a clap guys:)

Thank you!

Feature Selection in Machine Learning

Introduction

Feature Selection

“Too many cooks spoil the broth”

Why does this happen?

So how to solve this problem?

Advantages of Feature Selection

Techniques to select right features

1) Feature importance

Mobile Price Classification

classify mobile price range

2) Correlation with Heatmap

3) Forward Selection

Conclusion

Machine Learning Project: Titanic Problem Statement

Introduction

Sameer Kumar - Founder of Delhi Capitals blog page - SportsHosts | LinkedIn

View Sameer Kumar's profile on LinkedIn, the world's largest professional community. Sameer has 3 jobs listed on their…

Written by Sameer Kumar