Solve Machine Learning problems: Feature Selection (part 4)

Published in

Analytics Vidhya

5 min readApr 19, 2021

Introduction

When building a machine learning model, it’s almost rare that all the variables in the dataset are useful to build a model. Feature Selection is a process of selection a subset of Relevant Features from all features, which is used to make Model Building.

If we have few features then it is easy to interpret the model, less likely to overfit but it will give low prediction accuracy.
if we have more features then it is difficult to Interpret model, more likely to overfit and it will give high prediction accuracy.
It helps us to reduce the dimension of features, without much loss of information.

Filter Method

Filter methods pick up the intrinsic properties of the features measured via univariate statistics instead of cross-validation performance. These methods are faster and less computationally expensive than wrapper methods. When dealing with high-dimensional data, it is computationally cheaper to use filter methods.

Variance Threshold

The variance threshold is a simple baseline approach to feature selection. It removes all features which variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples. We assume that features with a higher variance may contain more useful information, but note that we are not taking the relationship between feature variables or feature and target variables into account.

we can see the change in the number of features before and after applying the variance threshold method.

Univariate selection methods

Univariate feature selection methods work by selecting the best features based on univariate statistical tests like ANOVA. It can be seen as a preprocessing step to an estimator

SelectKBest
This method selects features according to the k highest scores. for instance, we can perform a mutual_info_regression test on the samples to retrieve only the best features from the dataset.

input a scoring function that returns univariate scores and p-values (or only scores for SelectKBest and SelectPercentile)
chi2 is used when it contain only non-negative features such as booleans or frequencies , relative to the classes.
f_regression is used for the regression with continuous variables when the variables are categorical then we use f_classif
mutual_info_classif it works with both continuous and discrete variables for classification model
mutual_info_regression it works with both continuous and discrete independent variables for regression model

we used mutual_info_regression as scoring function and extracted the top 4 features.

2. SelectPercentile
Select features according to a percentile of the highest scores. uses scoring function to returns univariate scores and p-values.

we used mutual_info_regression as scoring function and extracted the top 3 percentile features.

Correlation-Matrix with Heatmap

Correlation is a measure of the linear relationship of 2 or more variables. Through correlation, we can predict one variable from the other. The logic behind using correlation for feature selection is that the good variables are highly correlated with the target. Furthermore, variables should be correlated with the target but should be uncorrelated among themselves.

If two variables are correlated, we can predict one from the other. Therefore, if two features are correlated, the model only really needs one of them, as the second one does not add additional information.

Embedded Methods

LASSO Regularization

Regularization consists of adding a penalty to the different parameters of the machine learning model to reduce the freedom of the model, i.e. to avoid over-fitting. In linear model regularization, the penalty is applied over the coefficients that multiply each of the predictors. From the different types of regularization, Lasso or L1 has the property that is able to shrink some of the coefficients to zero. So, that feature can be removed from the model.

we can see the coefficient of features reduced to zero. Initially number of features 284 after lasso number of features 255.

If the penalty is too high and important features are removed, we will notice a drop in the performance of the algorithm and then realise that we need to decrease the regularisation.

Random Forest Importance

Random Forests is a kind of a Bagging Algorithm that aggregates a specified number of decision trees. The tree-based strategies used by random forests naturally rank by how well they improve the purity of the node, or in other words a decrease in the impurity (Gini impurity) over all trees. Nodes with the greatest decrease in impurity happen at the start of the trees, while notes with the least decrease in impurity occur at the end of trees. by pruning trees below a particular node, we can create a subset of the most important features.

when training a tree, it is possible to compute how much each feature decreases the impurity. The more a feature decreases the impurity, the more important the feature is. In random forests, the impurity decrease from each feature can be averaged across trees to determine the final importance of the variable.

SelectFromModel will select those features which importance is greater than the mean importance of all the features by default, but we can alter this threshold if we want.

we see the number of features before and after performing random forest in Above pic.

Conclusion

We have discussed a few techniques for feature selection. Feature selection is a wide, complicated field and a lot of studies has already been made to figure out the best methods. It depends on the machine learning engineer to combine and innovate approaches, test them and then see what works best for the given problem.