Feature Selection Algorithms for Machine Learning

Choosing the right ones

Published in

CodeX

6 min readMay 7, 2022

Feature Selection is an optional, yet important preprocessing step for your Machine Learning model. It is a common practice to feed your Machine Learning model data as you receive it. This is a very common rookie mistake, data that you receive from whatever source will always contain errors. There are 2 important steps that you must carry out on this data before moving on to training the models.

Data Cleaning
Feature filtering/ Selection

I have talked about Data Cleaning multiple times in other articles, if you would like to learn more about it then you can check the following blog post.

Applied Data Science with Python and Pandas - WritersByte

Data Science is a very important skill that has become a necessity of the 21st century. With the increase in data, the…

writersbyte.com

Feature Selection

However, in this post, we will talk about selecting important input features from your dataset such that it has minimum impact on the outputs and reduces the complexity of the data.

First, let's talk about how correlation amongst input variables affects a Machine Learning model.

Correlation between features

Suppose you have a dataset with the following 2 features and the output labeled Y as shown in the diagram below.

variables x_2 and x_3 have exactly the same value throughout the dataset. A Machine Learning model learns the given pattern in the input variables corresponding to each output variable. Now for variables that have a high correlation(x_2 and x_3 in the above diagram), one of these variables is not providing any valuable information to our model. This is because they both have the same pattern, it is as if they are the same variable.

Reduced data set with only important features

Since one of these variables adds no additional information to the model we can remove the feature altogether. What’s the harm in keeping such a variable you may ask?

well…

It increases the dimensionality of the data unnecessarily. This increases training times and we may run into the problem of curse of dimensionality.
Although in most models, keeping the feature may not make the model any “worse” however such highly correlated variables affect different models differently. These might cause to add confusion to the model and hence reduce performance.

If you find my work helpful, consider supporting me on Kofi. Click on the image below.

Algorithms

At this point it may seem quite simple, just remove a feature at random amongst the 2. However in real life, you will rarely find variables that show EXACTLY the same pattern/values across the dataset, hence there are a few more things to consider before removing a variable or deciding whether two variables are even highly correlated or not.

This is why there are specialized algorithms that decide for you, which variables to keep and which to remove. We will talk about 2 such algorithms.

Boruta Feature Selection

Boruta Feature selection algorithm was first introduced as a package for R. It is a very useful algorithm that defines its own thresholds and provides you with the most accurate features from the provided dataset.

A complete explanation and implementation of Boruta can be found here:

Boruta Feature Selection Explained in Python - WritersByte

This article aims to explain, the very popular, Boruta feature selection algorithm. Boruta automates the process of…

writersbyte.com

Boruta shuffles the provided input features (each feature column separately) and then concatenates these (called shadow features) with the original data. After this, the complete data set is trained using a Random Forest classifier. This classifier returns feature importance for the entire input. Boruta then sets the threshold as the strongest shuffled(shadow) feature.

Any real feature which has an importance level lower than the most important shuffled feature is dropped. Boruta has a python package that helps you calculate the features. Below is a demonstration of how it works.

# install the package
!pip install boruta# import important libraries
import pandas as pd
from boruta import BorutaPy
from sklearn.ensemble import RandomForestRegressor
import numpy as np

Now we load the dataset and clean it up a little bit like removing NaN values and converting categorical variables to numerical representation.

#load data
heart_data = pd.read_csv("healthcare-dataset-stroke-data.csv")# converting to numericheart_data["gender"] = pd.factorize(heart_data["gender"])[0]
heart_data["ever_married"] = pd.factorize(heart_data["ever_married"])[0]
heart_data["work_type"] = pd.factorize(heart_data["work_type"])[0]
heart_data["Residence_type"] = pd.factorize(heart_data["Residence_type"])[0]
heart_data["smoking_status"] = pd.factorize(heart_data["smoking_status"])[0]# additional cleaning
heart_data.dropna(inplace =True)
heart_data.drop("id", axis =1, inplace = True)heart_data.head()

The Final dataset looks like the one below.

Dataset after cleaning — Heart Stroke dataset

Now let’s run the Boruta algorithm.

X = heart_data.drop("stroke", axis = 1)
y = heart_data["stroke"]# we will use the randomforest algorithm
forest = RandomForestRegressor(n_jobs = -1,max_depth = 10)
# initialize boruta
boruta = BorutaPy(estimator = forest, n_estimators = 'auto',max_iter = 50,)# Boruta accepts np.array 
boruta.fit(np.array(X), np.array(y))# get results
green_area = X.columns[boruta.support_].to_list()
blue_area = X.columns[boruta.support_weak_].to_list()
print('Selected Features:', green_area)
print('Blue area features:', blue_area)

Result of the Boruta Algorithm — Result of the Boruta algortihm

So out of the 10 original features, Boruta believes that only the 2 features returned are the most important features to make any reasonable decision.

mRMR Feature Selection

MRMR stands for Maximum Relevance Minimum Redundancy. While Boruta looks amongst the features to find the most important ones, MRMR makes sure that the features selected are not only the ones that provide minimum correlation between the input features but also have a high correlation with the output variable.

This algorithm was first introduced in the following paper.

MRMR works iteratively, it first asks you how many features you want to keep, and then for every iteration it calculates 1 feature that is most relevant to the output variable and least related to any of the features in our dataset. Once a feature is selected it is removed from the original dataset and the next iteration begins until K (the number of features we require) iterations are completed.

I will explain the details of the algorithm in a separate post. For now, let’s look at its python implementation.

Install the python package using the following command

!pip install mrmr_selection

You can find the complete documentation for this package at their official Github repository here.

The usage is quite straightforward.

from mrmr import mrmr_classif
selected_features = mrmr_classif(X=X, y=y, K=2)

I have set K as 2 just to see if the selected features match with what we are returned by Boruta.

print(selected_features)

And well yes we have the exact same features as what we got from the Boruta algorithm above. However, what makes MRMR flexible is that if you believe that 2 features might not be enough to get you a better result then you can choose to use as many as you want.

Let’s carry out a few more runs.

# top 4 features
top_4 = mrmr_classif(X=X, y=y, K=4)
# top 6 features
top_6 = mrmr_classif(X=X, y=y, K=6)print("Best 4 features:", top_4)
print("Best 6 features:", top_6)

Features returned by MRMR for k = 4 and k = 6

Conclusion

Feature selection is a live saver when you are low on memory resources and, at times, can even help improve the performance of your model. It is an essential step in the process of building your machine learning model.

Consider supporting me on Kofi. Click on the image below.

To learn more about Machine Learning you can check out the following articles.

Naive Bayes Python Implementation and Understanding - WritersByte

Naive Bayes is a Machine Learning Classifier that is based on the Bayes Theoram of conditional probability. In this…

writersbyte.com

Multi-variable Linear Regression Python Implementation.

Machine Learning algorithms have gained massive popularity over the last decade. Today these algorithms are used in…

writersbyte.com

Feature Selection Algorithms for Machine Learning

Choosing the right ones

Applied Data Science with Python and Pandas - WritersByte

Data Science is a very important skill that has become a necessity of the 21st century. With the increase in data, the…

Feature Selection

Correlation between features

Algorithms

Boruta Feature Selection

Boruta Feature Selection Explained in Python - WritersByte

This article aims to explain, the very popular, Boruta feature selection algorithm. Boruta automates the process of…

mRMR Feature Selection

Conclusion

Naive Bayes Python Implementation and Understanding - WritersByte

Naive Bayes is a Machine Learning Classifier that is based on the Bayes Theoram of conditional probability. In this…

Multi-variable Linear Regression Python Implementation.

Machine Learning algorithms have gained massive popularity over the last decade. Today these algorithms are used in…

Written by Moosa Ali