Feature Selection: key Technique to Improve Models accuracy
Hello friends,
This is my first blog and I will fully help you to understand the key tools of data science 😉.
My first blog is based on feature selection.
Feature Selection is the process of selecting optimal number of features from a larger set of features. There are several advantages of this feature selection process and also there are various techniques available for this feature selection process. In this blog, we will look at these advantages and various techniques for feature selection.
Why we do feature selection ?
This is the question arises from many data scientist and those who are interested in the field of data science. If you are familiar with Data Science field then, you must know the main key tool of Data Science which is Machine Learning. You must have knowledge of How we create Machine Learning Models and How they work and why they are important in the Field of Data Science.
If you are beginner and want to dive into the sea of Machine Learning. I recommend to check blogs to get strong Knowledge on Machine Learning. He had done a fabulous job in his course wherein He had put all the major Machine Learning techniques at multiple parts to view them one by one easily. I have adapted some tips from his blogs. I like to congratulate him for his excellent work.
As we were talking about why do we need feature selection? the answer is that prepare our data in informative form and build Machine Learning Model to predict our desirable output and we always try to find out good accuracy and scalability from our ML Model. but How we get it ? the answer is by selecting Powerful features. if you have 10 features but some of them are not much powerful or helpful for our ML Model and do not increases our Model’s accuracy too much, then we need the tool of feature selection. we select important features for our ML Model to increase the accuracy. there are many other advantages of using feature selection, I mentioned on below paragraph.
Advantages of selecting features:
- Improved accuracy
- Simple models are easier to interpret.
- Shorter training times
- Enhanced generalization by reducing Overfitting
- Easier to implement by software developers
- Reduced risk of data errors by model use
- Variable redundancy
- Bad learning behaviour in high dimensional spaces
Feature Selection — Techniques:
- Filter methods
- Wrapper methods
- Embedded methods
These are the 3 main feature selection techniques in field of data science. These also have types so we discuss them one by one.
1. Filter Methods
- Filter methods consists of various techniques as given below:-
- Basic methods
- Univariate methods
- Information gain
- Correlation Matrix with Heatmap or without Heatmap
These Filter methods are basically used as a preprocessing step.The selection of features is independent of any machine learning algorithms. Instead, features are selected on the basis of their scores in various statistical tests for their correlation with the outcome variable. The characteristics of these methods are as follows:-
- These methods rely on the characteristics of the data (feature characteristics)
- They do not use machine learning algorithms.
- These are model agnostic.
- They tend to be less computationally expensive.
- They usually give lower prediction performance than wrapper methods.
- They are very well suited for a quick screen and removal of irrelevant features.
1.1 Basic Methods: under basic method, we remove constant and quasi- constant features.
- Constant features are those that show the same value, just one value, for all the observations of the dataset. These feature have 0 variance because they have same value for all rows and did not change on any cell. These features provide no information that allows a machine learning model to discriminate or predict a target.
- Identifying and removing constant features, is an easy first step towards feature selection and more easily interpretable machine learning models. To identify constant features, we have multiple ways but we can use the VarianceThreshold function from sklearn.
- Here is the link to see code: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html
Under Quasi constant method, we check those features which have very low variance means they have some rows which have different values but in very low population. Quasi constant method finds those features which have same value for great majority of the observations of the dataset. In general, these features provide little if any information that allows a machine learning model to discriminate or predict a target.But there can be exceptions. So we should be careful when removing these type of features. Identifying and removing quasi-constant features, is an easy first step towards feature selection and more easily interpretable machine learning models.
we use same function as we used for constant method, VarianceThreshold from sklearn.
Code:
import pandas as pd
from sklearn.preprocessing import VarianceThresholdtrain_data = pd.read_csv('any_file_train.csv') // you can use any csv filevt = VarianceThreshold(threshold=0.2) // indicates 80% of the observationvt.fit(train_data) // fit function finds the features with low variance.vt.get_support() // use this function to find out the features which does not meet the threshold or we can say that we will use these features.// Simple way of using it:
print(train_data.columns[vt.get_support()]) // it finds features we will use.
1.2 Univariate selection methods:
Univariate selection methods used to select best features using some testing and using some statistical methods like ANOVA. Scikit-learn exposes feature selection routines as objects that implement the transform method
The methods based on F-test estimate the degree of linear dependency between two random variables. They assume a linear relationship between the feature and the target. These methods also assume that the variables follow a Gaussian distribution.
There are some methods which fall under this category:
SelectKBest, SelectKPercentile and some others.
Source code: https://scikit-learn.org/stable/modules/feature_selection.html
SelectKBest source code: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest
SelectKPercentile: Select features according to a percentile of the highest scores.
These objects take as input a scoring function that returns univariate scores and p-values (or only scores for SelectKBest and SelectKPercentile)
for regression problems use: f_regression and mutual_info_regression
for classification problems use: chi2, f_classif, mutual_info_classif
The methods based on F-test estimate the degree of linear dependency between two random variables. On the other hand, mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation.
1.3 Information Gain:
it is a technique use to measure the importance of the feature in the Model. it checks how much a feature contributed on making correct predictions on the target.
Source code for discrete target: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif
Source code for continuous target: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html#sklearn.feature_selection.mutual_info_regression
1.4 Correlation Matrix with Heatmap and without Heatmap
correlation is the method in statistics which is use to detect correlated features or we can say that it shows the linear relationship between 2 variables. In Pearson correlation the values from -1 to +1. Important features are highly correlated with Target variable. on the other hand, variable with high correlation with other variables should be dropped beacuse it reduces the accuracy because it shows dependency of one variable on another, it can create bias in ML Model.
The main point to explain here is: “Good feature subsets contain features highly correlated with the target, yet uncorrelated to each other”
If the correlation between two features is less than 0 this means that increasing the values in one feature will make decrease the values in the other feature (the closer the correlation coefficient is to -1 and the stronger is going to be this relationship between the two different features).
it looks like this:
there are different kind of heatmaps to show the correlation but I showed only one here to understand the concept of correlation.
2. Wrapper Methods
- In wrapper methods, we try to use a subset of features and train a model using them. Based on the inferences that we draw from the previous model, we decide to add or remove features from the subset. The problem is essentially reduced to a search problem. These methods are usually computationally very expensive.
- Some common examples of wrapper methods are
- Forward selection,
- Backward elimination,
- Exhaustive feature selection,
- Recursive feature elimination.
- Recursive feature elimination with cross-validation
2.1 Forward selection
This process starts with empty set of features and we slowly add features which best improves our model and the stopping criteria is if a feature added and did not improves the performance of our model, it will stop adding features.
Step forward feature selection starts by evaluating all features individually and selects the one that generates the best performing algorithm, according to a pre-set evaluation criteria. In the second step, it evaluates all possible combinations of the selected feature and a second feature, and selects the pair that produce the best performing algorithm based on the same pre-set criteria. the pre-set criteria is roc_auc for classification and r squared for regression.
The mlxtend package use for forward selection method.
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
mod = SFS(RandomForestClassifier(),
k_features=5,
forward=True,
floating=False,
verbose=6,
scoring='roc_auc',
cv=5)
// fit this
mod.fit(np.array(train_data),y) // y is the output column and train_data is the input columns.
2.2 Backward Elimination
It is the opposite of Forward Selection, use with all set of features and in every iteration it finds out the least significant feature which improves the performance of the model. we repeat until no improvement find out.
It also use mlxtend package and use same function but with little change in parameters.
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
mod = SFS(RandomForestClassifier(),
k_features=5,
forward=False,
floating=False, // just false this it will be backward
verbose=6,
scoring='roc_auc',
cv=5)
// fit this
mod.fit(np.array(train_data),y) // y is the output column and train_data is the input columns.
2.3 Exhaustive Feature Selection
This feature selection algorithm is a wrapper approach for brute-force evaluation of feature subsets; the best subset is selected by optimizing a specified performance metric given an arbitrary regressor or classifier. For instance, if the classifier is a logistic regression and the dataset consists of 4 features, the alogorithm will evaluate all 15 feature combinations.
- all possible combinations of 1 feature
- all possible combinations of 2 features
- all possible combinations of 3 features
- all the 4 features
we use mlxtend package for this function also.
I defined the 3 methods of feature selection which is use in basic form. but for learning more about the wrapper methods and remaining three methods I suggest to check these sites:
For Recursive feature elimination: https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_digits.html#sphx-glr-auto-examples-feature-selection-plot-rfe-digits-py
For Recursive feature elimination with cross-validation: https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html#sphx-glr-auto-examples-feature-selection-plot-rfe-with-cross-validation-py
3. Embedded Methods
- Embedded methods are iterative during a sense that takes care of every iteration of the model training process and thoroughly extract those features which contribute the foremost to the training for a selected iteration. Regularization methods are the foremost commonly used embedded methods which penalize a feature given a coefficient threshold.
- Some of the most popular examples of these methods are LASSO and RIDGE regression which have inbuilt penalization functions to reduce overfitting.
3.1 LASSO Regression
It is used over regression methods for a more accurate prediction. This model uses shrinkage. Shrinkage is where data values are shrunk towards a central point as the mean.
Source code: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
3.2 Random Forest Importance
Random Forest is the one of the most important model in Machine Learning. it is inspires from great technique of bagging. Random Forest works as that different Decision Trees show their output and the majority output will be choosed as the main output from Random Forest Model.
- It’s more accurate than the decision tree algorithm.
- It provides an effective way of handling missing data.
- It can produce a reasonable prediction without hyper-parameter tuning.
- It solves the issue of overfitting in decision trees.
- In every random forest tree, a subset of features is selected randomly at the node’s splitting point.
feature importance is the attribute of Random Forest Model, this attribute use after fit function means after training of the Random Forest Model.
from sklearn.ensemble import RandomForestClassifier// I assume to have x_train,y_train,x_test and y_test data in clean form.model = RandomForestClassifier(n_estimators=100,max_depth=5)model.fit(x_train,y_train)model.feature_importances_
Here we clearly see that, odor_n have the highest feature importance and some have very low but we easily choose some of the columns for further classification.
I have mentioned these points on the basis of my knowledge, If you are a beginner and wants to do projects and polish your skills. I would recommend to use Kaggle Platform to start your career in Data Science.
If you have any query regarding this topic or any other topic, feel free to ask on comments.
References:
[1]:
[2]:
https://www.izen.ai/blog-posts/feature-selection-filter-method-wrapper-method-and-embedded-method/