Feature Selection For Dimensionality Reduction(Embedded Method)

Published in

Analytics Vidhya

5 min readJul 5, 2020

In machine learning selecting important features in the data is an important part of the full cycle.

Passing data with irrelevant features might affect the performance of the model because model learns the irrelevant features passed in it.

Need of Feature Selection:

It helps simplify models to make them easier and faster to train.
Reduces training times.
Helps avoid the curse of dimensionality,
Enhanced generalization by reducing overfitting (formally, reduction of variance)

Methods for Feature Selection

There are three general methods of feature selection :

Embedded Method

In Embedded Methods, the feature selection algorithm is integrated as part of the learning algorithm.
Embedded methods combine the qualities of filter and wrapper methods.
It’s implemented by algorithms that have their own feature selection methods in them.
A learning algorithm takes advantage of its own variable selection process and performs feature selection and classification/regression at the same time.
The most Common embedded technique are the tree algorithm’s like RandomForest, ExtraTree and so on.
Tree algorithms select a feature in each recursive step of the tree growth process and divide the sample set into smaller subsets. The more child nodes in a subset are in the same class, the more informative the features are.
Other Embedded Methods are the LASSO with the L1 penalty and Ridge with the L2 penalty for constructing a linear model. These two methods shrink many features to zero or almost near to zero.

Advantages of Embedded Methods:

They take into consideration the interaction of features like wrapper methods do.
They are faster like filter methods.
They are more accurate than filter methods.
They find the feature subset for the algorithm being trained.
They are much less prone to over-fitting.

Approaches of Embedded Methods:

Regularization Approach:

The Regularization approach that includes Lasso(L1 regularization) and Ridge(L2 regularization) and Elastic Nets(L1 and L2)

→ LASSO(L1 Regularization)

Least Absolute Shrinkage and Selection Operator(Lasso) is one of the powerful methods that helps perform regularization and feature selection of the given data.
It penalizes the beta coefficients in a model.
The Lasso method puts a limitation/restrictions on the sum of the values of the model parameters.
The sum has to be less than the specific fixed value.
This Shrinks some of the coefficients to zero, Indicating that a certain predictor or certain features will be multiplied by zero to estimate the target.
During this process the variables that have non-zero co-efficient after shrinking are selected to be the part of the model.
It also adds a penalty term to the cost function of a model, with a lambda value that must be tuned.

This is how LASSO helps in reducing the over-fitting caused and as well as help in feature selection.

When(λ) lambda is 0,the equation is reduced and this leads to no elimination of the parameters.
Increase in λ causes the increase in bias, Decrease in λ causes the increase in variance.

→ RIDGE(L2 Regularization)

RIDGE penalizes the beta co-efficients for being too large,However it does not brings down(shrinks) the co-efficient to zero rather it brings the co-efficients close to zero.
It helps reduce the model complexity and takes care of any multi-collinearity in the data.
RIDGE is not preferable when the data contains huge number of features out of which only few are actually important, as it might make the model simpler but the model built will have poor accuracy.
RIDGE decreases the complexity of a model but does not reduce the number of variables since it never leads to a coefficient been zero rather only minimizes it. Hence, this model is not good for feature reduction.

→ Elastic Nets(L1 and L2):

Lasso has been a popular algorithm for variable selection with high dimensional data. However, it sometimes over regularize the data.

So the question arises: What if we could use both L1 and L2 regularization?
Elastic Nets was introduced as a solution to this question.

Elastic Nets balance between LASSO and RIDGE penalties.

Lasso will eliminate features, and reduce over-fitting in the linear model. Ridge will reduce the impact of features that are not important in predicting the target values.

This is done with the help of the hyper parameter alpha(α). If α becomes 1 the model would become LASSO and when α becomes 0 the model will become RIDGE.

In order to tune the hyper-parameter alpha(α) cross-validation can be used.

φ here is the alpha(α) hyperparameter

Algorithm Based Approach:

This can be done using any kind of tree-based algorithm like Decision Tree, RandomForest or ExtraTree, XGBoost and so on.
The split takes place on a feature within the algorithm to find the correct variable.
The algorithm tries all possible ways of split for all the features and chooses the one that splits the data best. This basically means it uses wrapper method as all the possible combinations of features is tried and the best one is picked.
For classification the split happens typically either the Gini impurity or information gain/entropy and for Regression the split happens with the help of the variance.
With the help of this method we can find feature importance's and can remove feature below certain threshold.

from sklearn.ensemble import RandomForestmodel = RandomForest().fit(x,y)important_features = pd.DataFrame((etr.feature_importances_*100), index = x.columns, columns=['importance']).sort_values('importance', ascending=False)#Same can be done with ExtraTree too.

These are the few Embedded methods that help in selecting important feature for building a better performing model.
HAPPY LEARNING!!!!

Like my article? Do give me a clap and share it, as that will boost my confidence. Also, I post new articles every sunday so stay connected for future articles of the basics of data science and machine learning series.

Also, do connect with me on LinkedIn.