Feature Importance — Everything you need to know

A machine learning model is only as good as the features that it is trained on. But how do we find the best features for the problem statement? This is where feature selection comes in.

Sandeep Ram
Oct 25 · 5 min read
Image for post
Image for post
Photo by Anders Jildén on Unsplash

In this article, we will be exploring various feature selection techniques that we need to be familiar with, in order to get the best performance out of your model.

  • SelectKBest
  • Linear Regression
  • Random Forest
  • XGBoost
  • Recursive Feature Elimination
  • Boruta

SelectKbest is a method provided by sklearn to rank features of a dataset by their “importance ”with respect to the target variable. This “importance” is calculated using a score function which can be one of the following:

  • f_classif: ANOVA F-value between label/feature for classification tasks
  • f_regression: F-value between label/feature for regression tasks.
  • chi2: Chi-squared stats of non-negative features for classification tasks.
  • mutaul_info_classif: Mutual information for a discrete target.
  • SelectPercentile: Select features based on the percentile of the highest scores.
  • SelectFpr: Select features based on a false positive rate test.
  • SelectFdr: Select features based on an estimated false discovery rate.
  • SelectFwe: Select features based on the family-wise error rate.
  • GenericUnivariateSelect: Univariate feature selector with configurable mode.

All of the above-mentioned scoring functions are based on statistics. For instance, the f_regression function arranges the p_values of each of the variables in increasing order and picks the best K columns with the least p_value. Features with a p_value of less than 0.05 are considered “significant” and only these features should be used in the predictive model.

Significant Feature- P_value lesser than 0.05:

Image for post
Image for post
A Great Fit [Image by Author]

Insignificant Features- P_value more than 0.05

Image for post
Image for post
A Bad Fit [Image by Author]

This is one of the simplest methods as it is very computationally efficient and takes just a few lines of code to execute.

Why P_value is not the perfect feature selection technique?

P_value is an analysis of how each dependent variable is individually related to the target variable. However, this is not always the case. Let’s take an example to illustrate this

Consider a predictive regression model that tried to predict the price of a plot given the length and breadth of a plot. The p_value of each of these variables might actually be very large since neither of these features is directly related to the price. However, a combination of these 2 variables, specifically their product, gives the land area of the plot. This product has a very strong relationship with the price. Thus both length and breadth are significant features that are overlooked during p_value feature selection.

By comparing the coefficients of linear models, we can make an inference about which features are more important than others. This method does not work well when your linear model itself isn't a good fit for the dataset given. This method can be used if your model’s accuracy is around 95%

Image for post
Image for post
[Image from Source]

In the case of the above example, the coefficient of x1 and x3 are much higher than x2, so dropping x2 might seem like a good idea here. This approach is valid in this example as this model is a very good fit for the given data.

The Random Forest is a very elegant algorithm that usually gives highly accurate predictions, even with minimal hyperparameter tuning. However, this is not where its usefulness ends!

Random Forest, when imported from the sklearn library, provides a method where you can get the feature importance of each of the variables. This is a good method to gauge the feature importance on datasets where Random Forest fits the data with high accuracy.

Just like random forests, XGBoost models also have an inbuilt method to directly get the feature importance. XGBoost feature accuracy is much better than the methods that are mentioned above since:

  • Faster than Random Forests by far!
  • It is way more reliable than Linear Models, thus the feature importance is usually much more accurate
  • P_value test does not consider the relationship between two variables, thus the features with p_value > 0.05 might actually be important and vice versa. XGBoost usually does a good job of capturing the relationship between multiple variables while calculating feature importance
Image for post
Image for post
[Image by Author]

This algorithm recursively calculates the feature importances and then drops the least important feature. It starts off by calculating the feature importance for each of the columns. It then drops the column with the least importance score and proceeds to repeat the same.

NOTE: This algorithm assumes that none of the features are correlated. It is not advisable to use a feature if it has a Pearson correlation coefficient of more than 0.8 with any other feature.

Image for post
Image for post
[Image By Author]

Unlike the previously mentioned algorithms, Boruta is an all-relevant feature selection method while most algorithms are minimal optimal. What this means is that Boruta tries to find all features carrying useful information rather than a compact subset of features that give a minimal error.

Steps to Build a Boruta Selector:

  • Install Bortua
!pip install boruta
  • Make the necessary imports:
  • Establish Base Score to build upon
  • Train Boruta Feature Selector
  • Calculate scores on the shortlisted features and compare them!

I personally use this method in most of my work. More often than not, using Boruta significantly reduces the dimension while also providing a minor boost to accuracy.

When trained on Housing Price Regression Dataset, Boruta reduced the dimensions from 80+ features to just 16 while it also provided an accuracy boost of 0.003%!

Summary:

  • If the dataset is not too large, use Boruta for feature selection.
  • If XGboost or RandomForest gives more than 90% accuracy on the dataset, we can directly use their inbuilt method “.feature_importance_”
  • If you just want the relationship between any 2 variables and not the whole dataset itself, it’s ideal to go for p_value score or person correlation.

I hope you found this article informative. This article gives a surface-level understanding of many of the feature selection techniques. However, this is not an exhaustive list. Leave a comment if you feel any important feature selection technique is missing.

The Startup

Medium's largest active publication, followed by +733K people. Follow to join our community.

Sandeep Ram

Written by

Working with the intent to make it big in the Data Science community. Connect on Instagram @sandy31_03

The Startup

Medium's largest active publication, followed by +733K people. Follow to join our community.

Sandeep Ram

Written by

Working with the intent to make it big in the Data Science community. Connect on Instagram @sandy31_03

The Startup

Medium's largest active publication, followed by +733K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store