Feature Selection for Dimensionality Reduction(Filter Method).

Published in

Analytics Vidhya

5 min readJun 21, 2020

In machine learning selecting important features in the data is an important part of the full cycle.

Passing data with irrelevant features might affect the performance of the model because model learns the irrelevant features passed in it.

Need of Feature Selection:

It helps simplify models to make them easier and faster to train.
Reduces training times.
Helps avoid the curse of dimensionality,
Enhanced generalization by reducing overfitting (formally, reduction of variance)

Methods for Feature Selection

There are three general methods of feature selection :

Filter Method

This method is generally done as one of the pre-processing step before passing the data to build a model.
Various Statistical test are performed and the feature’s are selected on the basis of their score.
Filter Methods are less accurate but faster to compute.
It is preferable to use filter methods for larger datasets,as it is fast to compute.
Filters Methods are good for theoretical framework and to understand the structure of the data.

Different ways are:

Correlation method:
→ It is used as a measure linear dependency between two continuous variables X and Y.
→ Ranges between -1 to 1,where value closer to 1 shows that they are highly correlated and value closer to -1 indicates that they are negatively correlated.
→ Correlation method helps identify which variables closely resembles the other one.
→ Different correlation method includes:
* Pearson Correlation Co-efficient.
* Spearman Correlation Co-efficient.

Pearson and Spearman both are completely different test.
Pearson Correlation helps find the “Linear Relationship” between the variables,Whereas,Spearman Correlation helps find the “Monotonic Relationship” between the variables.
Mostly Pearson Correlation is preferred,however,I like to test both Pearson and Spearman.If Spearman test results are greater than Pearson than it shows that the variables shows more monotonic relationship rather than Linear Relationship.

Chi-Square Test:
→ Chi-square test is used for categorical features in a dataset.
→ We calculate Chi-square between each feature and the target and select the desired number of features with best Chi-square scores.
→ It determines if the association between two categorical variables of the sample would reflect their real association in the population.

Anova:
→ Analysis of Variance is a statistical method, used to check the means of two or more groups that are significantly different from each other. It assumes Hypothesis as
* Null: Means of all groups are equal.
* Alternate: At least one mean of the groups are different.
→ It checks the impact of one or more factors by comparing the means of different samples.
→ Anova and T-test basically performs the same when done on only two samples,However,If more than two samples are compared then Anova is used as using T-test will have a compunded effect on the error rate.
→ Performing T-test on more than two sample will yield to about 15% error rate approximately,whereas using ANOVA will keep it as low as 5% for 95% CI(Confidence Interval).

Confidence Interval is the range between which the samples statistic Lies.

Variance Inflation Factor(VIF):
→ A variance inflation factor (VIF) provides a measure of multicollinearity among the independent variables in a multiple regression model.
→ Detecting multicollinearity is important because while it does not reduce the explanatory power of the model, it does reduce the statistical significance of the independent variables.
→ A large VIF on an independent variable indicates a highly collinear relationship to the other variables that should be considered or adjusted for in the structure of the model and selection of independent variables.
→ VIF can be interpreted by the values we get:
* 1 — Shows Non-colinearity
* 1 to 5 — Shows somewhat collinearity is present
* >5 — High Collinaerity is present
→ It is determined with the help of co-efficient of determination(R-Squared) value.
Higher the R-Squared value more the value will be closer to 1.

Here, 1-R2 is also called as Tolerance.

There are many Filter Methods that determines which feature to select. Understanding when to use what comes with practice.However,I suggest to try out different methods and see which helps best in selecting the features without having too much impact on the accuracy of the model.
Given above are the few basic methods that should be understood thoroughly before moving onto other methods.

Coming up next week is WRAPPER METHOD for Feature Selection.

HAPPY LEARNING!!!!

Like my article? Do give me a clap and share it,as that will boost my confidence.Also,I post new articles every sunday so stay connected for future articles of the basics of data science and machine learning series.

Also,if you want then connect with me on linkedIn.