Use These Methods To Select Features For ML Model

In this article, we would discuss some feature selection techniques that you can use to select the features that will be fed to the Machine Learning model of your choice

INSAID
INSAID
Published in
7 min readDec 21, 2022

--

By: Daksh Bhatnagar

INTRODUCTION

In the data science community, it is usually said that your predictions are as good as your data which means if your data has been pre-processed correctly and in a thoughtful manner, 9 times out of 10 you are likely to get the results you are looking for. Some of the ways that you can preprocess your data are by:-

  1. Bringing all the features to the same scale
  2. Converting Categorical Columns to Numerical Columns
  3. Selecting top-n columns (based on some criteria) for the model training
  4. Removing the outliers from the data
Source: bigdataanalyticsnews.com

We will be focusing on the third aspect which is selecting the number of columns or features based on some logic and these features will help our model make better predictions. This is called feature engineering.

Feature engineering is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data. The goal is to use these extra features to improve the quality of results from a machine learning process, compared with supplying only the raw data to the machine learning process.

Feature Engineering is different from feature extraction in the sense that features engineering is the process of manually coming up with the columns that make sense for the model to be fed with however feature extraction is used in Deep Learning where the CNN will learn the features automatically by running a square matrix over the image given to the neural network.

Below is how the kernels move over the Convolutional Neural Networks to extract the features (edges and main features like face, in case of cat or dog classification problem) of an image.

Source: towardsdatascience.com

FEATURE SELECTION TECHNIQUES

  1. Filter Methods- In filter methods, features are selected on the basis of their scores in various statistical tests (Correlation, Chi-Square, ANOVA) for their correlation with the outcome variable.
  2. Wrapper Methods — In wrapper methods, we try to use a subset of features and train a model using them. Based on the inferences that we draw from the previous model, we decide to add or remove features from the subset.
  3. Embedded Methods — Embedded methods combine the qualities of filter and wrapper methods. It’s implemented by algorithms that have their own built-in feature selection methods. Some of the most popular examples of these methods are Lasso and Ridge regression which have inbuilt penalization functions to reduce overfitting.

Let’s dive a little deeper into what these are.

  1. Filter Methods:-

With filter methods, you can use Pearson’s Correlation, Chi-Square, and ANOVA to get the scores of the features.

Pearson correlation is a measure of linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1

A chi-square (χ2) statistic is a test that measures how a model compares to actual observed data. The data used in calculating a chi-square statistic must be random, raw, mutually exclusive, drawn from independent variables, and drawn from a large enough sample. For example, the results of tossing a fair coin meet these criteria.

ANOVA is an acronym for “analysis of variance” and is a parametric statistical hypothesis test for determining whether the means from two or more samples of data (often three or more) come from the same distribution or not.

For Correlation, you would want to pick and choose the features that have high correlation however for Chi-Square, you ideally should not pick the features that have higher p-values.

Correlation Heatmap
Chi-Square Test p-values of features

For ANOVA, the general guideline would be to pick the features that have higher ANOVA scores. From the image below we can see the features from the right have a higher ANOVA score hence we should be choosing the top-n columns from the right, not from the left.

ANOVA p-value score chart

2. Wrapper Methods

In wrapper methods, the feature selection process is based on a specific machine learning algorithm that we are trying to fit on a given dataset.

It follows a greedy search approach by evaluating all the possible combinations of features against the evaluation criterion. The evaluation criterion is simply the performance measure that depends on the type of problem, e.g. For regression evaluation criterion can be p-values, R-squared, or Adjusted R-squared, similarly, for classification, the evaluation criterion can be accuracy, precision, recall, f1-score, etc.

Finally, it selects the combination of features that gives the optimal results for the specified machine-learning algorithm.

How Wrapper Methods work to find the best features

You can use Recursive Feature Elimination or Forward feature Selection methods which fall under the wrapper methods of feature selection in machine learning

a. Recursive Feature Elimination

Recursive Feature Elimination (RFE) is a transformer estimator, which means it follows the familiar fit/transform pattern of Sklearn. It is a popular algorithm due to its easily configurable nature and robust performance. As the name suggests, it removes features one at a time based on the weights given by a model of our choice in each iteration.

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features.

First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through any specific attribute or callable. Then, the least important features are pruned from the current set of features.

This procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

b. Forward feature selection

It starts with the evaluation of each individual feature and selects which results in the best-performing selected algorithm model. Next, all possible combinations of the selected feature and a subsequent feature are evaluated, and a second feature is selected, and so on until the required predefined number of features is selected.

Step backward feature selection is closely related, and as you may have guessed starts with the entire set of features and works backward from there, removing features to find the optimal subset of a predefined size.

These are both potentially very computationally expensive as these methods may take too long to be at all useful or may be totally infeasible. That said, with a dataset of accommodating size and dimensionality, such an approach may well be your best possible approach.

3. Embedded Methods —

Embedded methods as discussed above are the methods that combine the filter and wrapper methods. There are algorithms that implement finding out the features selection techniques and will be able to tell you the feature importances using which the final model can be fed with only those models.

Random Forest, Decision Trees, XGBoost, etc are some of the algorithms where the method of feature selection is already implemented. Lasso and Ridge regression implement regularization that penalizes features that add no value to the model which makes the helpful features make predictions accurately.

CONCLUSION

  1. Feature engineering is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data
  2. The filter methods and the wrapper can only be used for small to medium-sized datasets as with larger datasets, the computational time and power will be more.
  3. There is no best feature selection method. Just like there is no best set of input variables or best machine learning algorithm. At least not universally. Instead, you must discover what works best for your specific problem using careful systematic experimentation.
  4. Try a range of different models that fit on different subsets of features chosen via different statistical measures and discover what works best for your specific problem.

Final Thoughts and Closing Comments

There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website.

If you liked this article, I recommend you go with the Global Certificate in Data Science & AI because this one will cover your foundations, machine learning algorithms, and deep neural networks (basic to advance).

--

--

INSAID
INSAID
Editor for

One of India’s leading institutions providing world-class Data Science & AI programs for working professionals with a mission to groom Data leaders of tomorrow!