Feature Selection Algorithms for Machine Learning
Choosing the right ones
Feature Selection is an optional, yet important preprocessing step for your Machine Learning model. It is a common practice to feed your Machine Learning model data as you receive it. This is a very common rookie mistake, data that you receive from whatever source will always contain errors. There are 2 important steps that you must carry out on this data before moving on to training the models.
- Data Cleaning
- Feature filtering/ Selection
I have talked about Data Cleaning multiple times in other articles, if you would like to learn more about it then you can check the following blog post.
Applied Data Science with Python and Pandas - WritersByte
Data Science is a very important skill that has become a necessity of the 21st century. With the increase in data, the…
However, in this post, we will talk about selecting important input features from your dataset such that it has minimum impact on the outputs and reduces the complexity of the data.
First, let's talk about how correlation amongst input variables affects a Machine Learning model.
Correlation between features
Suppose you have a dataset with the following 2 features and the output labeled Y as shown in the diagram below.
variables x_2 and x_3 have exactly the same value throughout the dataset. A Machine Learning model learns the given pattern in the input variables corresponding to each output variable. Now for variables that have a high correlation(x_2 and x_3 in the above diagram), one of these variables is not providing any valuable information to our model. This is because they both have the same pattern, it is as if they are the same variable.
Since one of these variables adds no additional information to the model we can remove the feature altogether. What’s the harm in keeping such a variable you may ask?
- It increases the dimensionality of the data unnecessarily. This increases training times and we may run into the problem of curse of dimensionality.
- Although in most models, keeping the feature may not make the model any “worse” however such highly correlated variables affect different models differently. These might cause to add confusion to the model and hence reduce performance.
If you find my work helpful, consider supporting me on Kofi. Click on the image below.
At this point it may seem quite simple, just remove a feature at random amongst the 2. However in real life, you will rarely find variables that show EXACTLY the same pattern/values across the dataset, hence there are a few more things to consider before removing a variable or deciding whether two variables are even highly correlated or not.
This is why there are specialized algorithms that decide for you, which variables to keep and which to remove. We will talk about 2 such algorithms.
Boruta Feature Selection
Boruta Feature selection algorithm was first introduced as a package for R. It is a very useful algorithm that defines its own thresholds and provides you with the most accurate features from the provided dataset.
A complete explanation and implementation of Boruta can be found here:
Boruta Feature Selection Explained in Python - WritersByte
This article aims to explain, the very popular, Boruta feature selection algorithm. Boruta automates the process of…
Boruta shuffles the provided input features (each feature column separately) and then concatenates these (called shadow features) with the original data. After this, the complete data set is trained using a Random Forest classifier. This classifier returns feature importance for the entire input. Boruta then sets the threshold as the strongest shuffled(shadow) feature.
Any real feature which has an importance level lower than the most important shuffled feature is dropped. Boruta has a python package that helps you calculate the features. Below is a demonstration of how it works.
# install the package
!pip install boruta# import important libraries
import pandas as pd
from boruta import BorutaPy
from sklearn.ensemble import RandomForestRegressor
import numpy as np
Now we load the dataset and clean it up a little bit like removing NaN values and converting categorical variables to numerical representation.
heart_data = pd.read_csv("healthcare-dataset-stroke-data.csv")# converting to numericheart_data["gender"] = pd.factorize(heart_data["gender"])
heart_data["ever_married"] = pd.factorize(heart_data["ever_married"])
heart_data["work_type"] = pd.factorize(heart_data["work_type"])
heart_data["Residence_type"] = pd.factorize(heart_data["Residence_type"])
heart_data["smoking_status"] = pd.factorize(heart_data["smoking_status"])# additional cleaning
heart_data.drop("id", axis =1, inplace = True)heart_data.head()
The Final dataset looks like the one below.
Now let’s run the Boruta algorithm.
X = heart_data.drop("stroke", axis = 1)
y = heart_data["stroke"]# we will use the randomforest algorithm
forest = RandomForestRegressor(n_jobs = -1,max_depth = 10)
# initialize boruta
boruta = BorutaPy(estimator = forest, n_estimators = 'auto',max_iter = 50,)# Boruta accepts np.array
boruta.fit(np.array(X), np.array(y))# get results
green_area = X.columns[boruta.support_].to_list()
blue_area = X.columns[boruta.support_weak_].to_list()
print('Selected Features:', green_area)
print('Blue area features:', blue_area)
So out of the 10 original features, Boruta believes that only the 2 features returned are the most important features to make any reasonable decision.
mRMR Feature Selection
MRMR stands for Maximum Relevance Minimum Redundancy. While Boruta looks amongst the features to find the most important ones, MRMR makes sure that the features selected are not only the ones that provide minimum correlation between the input features but also have a high correlation with the output variable.
This algorithm was first introduced in the following paper.
MRMR works iteratively, it first asks you how many features you want to keep, and then for every iteration it calculates 1 feature that is most relevant to the output variable and least related to any of the features in our dataset. Once a feature is selected it is removed from the original dataset and the next iteration begins until K (the number of features we require) iterations are completed.
I will explain the details of the algorithm in a separate post. For now, let’s look at its python implementation.
Install the python package using the following command
!pip install mrmr_selection
You can find the complete documentation for this package at their official Github repository here.
The usage is quite straightforward.
from mrmr import mrmr_classif
selected_features = mrmr_classif(X=X, y=y, K=2)
I have set K as 2 just to see if the selected features match with what we are returned by Boruta.
And well yes we have the exact same features as what we got from the Boruta algorithm above. However, what makes MRMR flexible is that if you believe that 2 features might not be enough to get you a better result then you can choose to use as many as you want.
Let’s carry out a few more runs.
# top 4 features
top_4 = mrmr_classif(X=X, y=y, K=4)
# top 6 features
top_6 = mrmr_classif(X=X, y=y, K=6)print("Best 4 features:", top_4)
print("Best 6 features:", top_6)
Feature selection is a live saver when you are low on memory resources and, at times, can even help improve the performance of your model. It is an essential step in the process of building your machine learning model.
Consider supporting me on Kofi. Click on the image below.
To learn more about Machine Learning you can check out the following articles.
Naive Bayes Python Implementation and Understanding - WritersByte
Naive Bayes is a Machine Learning Classifier that is based on the Bayes Theoram of conditional probability. In this…