Why Feature Selection?

Role of Feature Selection in Machine Learning

Published in

Analytics Vidhya

4 min readAug 25, 2019

Feature selection plays a vital role in machine learning as well as in predictive modelling. It is one of the measure that lies under dimensionality reduction.

Introduction:

Feature selection is mainly “the process of selecting subset of relevant features for processing, without any transformation”. It is also known as attribute selection or variable selection. It helps in selecting the most appropriate features amongst the available. Feature selection can be performed manually or automatically.

Importance:

Features may be expensive to obtain, thus feature selection is helpful
If features undergo transformation, their measurement units are lost. But in feature selection measurement units are maintained.
It helps in improving accuracy of the model
It also reduces the time required by the model to train itself
Discards the garbage data

Types of Feature Selection:

In this tutorial we will discuss about the three main categories of feature selection along with their examples.

1. Filter Method:

Filter methods are assumed to be univariate i.e. the feature is considered independently or with dependent variable. Here, features with higher variance may get selected assuming they may contain useful data. Hence, the disadvantage faced is feature variable and target variable relationship is not maintained. Following are few examples of filter methods,

Chi square test — This method is used for testing the independence of two events. If a dataset consists of two events, we obtain the observed value and the expected value and this test measures by how much the two event deviate from one another.
Variance Threshold — This method is responsible for discarding those features that cannot match up to a certain threshold.
Information Gain — This method gives the information for the attributes given in a set, so that we can discriminate between different class of attributes.

2. Wrapper Method:

Wrapper methods are stochastic or rather use heuristics. Based on inferential data, it decides which features to keep and discard. Here new features are added so as to increase the performance of the model. But every time the model needs training and cross validation for every feature set combination. Thus this is an expensive method. Following are few examples of wrapper method,

Recursive Feature Elimination(RFE) — This method fits a model and removes the weakest feature unless the specified number of features are satisfied. It ranks the features according to the elimination process done every time the features are checked and discarded.
Forward Selection — This method starts with no feature model and keeps on adding the variables so as to improve the model performance. It keeps on adding the variables until the variable addition does not improve the model performance any more.
Backward Selection — This is exactly opposite to that of the above method. Here we start with all the features and try to remove the most irrelevant features and check the model performance at every iteration. This process goes on continuing until no change is seen.

3. Embedded Method:

Embedded methods learns which features best contribute to the accuracy of the model. This method tries to combine the efficiency as well as qualities of both the above mentioned methods. These methods possesses the inbuilt variable selection methods. Following are few examples,

Lasso(Least Absolute Shrinkage & Selection Operator) Regression — It is also known as L1 regularization. L1 method is used for generalized linear models. It can be understood as adding a penalty term against the complexity for reducing the problem of overfitting. Basically regularization is the process of including additional information in order to solve ill posed problems or avoid overfitting. Objective function to minimize,

Ridge Regression — It is also known as L2 regularization. L2 calculates the least squares error of the magnitudes of coefficients, but is more sensitive to outliers. It does not give sparse solutions. It minimizes the effect size of unimportant features to values close to zero. The cost function is,

Elastic Net Regression — This is trained using both L1 & L2 that allows learning of sparse model where few entries are non zero similar to Lasso and also maintaining the regularization properties similar to Ridge regression.

Thus, using the above methods of feature selection enable easy interpretation of attributes. It also helps in discarding the variables with less significant values, thus increasing the model’s predictive accuracy, efficiency and also reduction in time complexity.