Feature Selection Algorithms for Machine Learning

Choosing the right ones

Feature Selection

Correlation between features

Data set with redundant features
Reduced data set with only important features
Boruta Feature Selection

# install the package
!pip install boruta
# import important libraries
import pandas as pd
from boruta import BorutaPy
from sklearn.ensemble import RandomForestRegressor
import numpy as np
#load data
heart_data = pd.read_csv("healthcare-dataset-stroke-data.csv")
# converting to numericheart_data["gender"] = pd.factorize(heart_data["gender"])[0]
heart_data["ever_married"] = pd.factorize(heart_data["ever_married"])[0]
heart_data["work_type"] = pd.factorize(heart_data["work_type"])[0]
heart_data["Residence_type"] = pd.factorize(heart_data["Residence_type"])[0]
heart_data["smoking_status"] = pd.factorize(heart_data["smoking_status"])[0]
# additional cleaning
heart_data.dropna(inplace =True)
heart_data.drop("id", axis =1, inplace = True)
Dataset after cleaning
Heart Stroke dataset
X = heart_data.drop("stroke", axis = 1)
y = heart_data["stroke"]
# we will use the randomforest algorithm
forest = RandomForestRegressor(n_jobs = -1,max_depth = 10)
# initialize boruta
boruta = BorutaPy(estimator = forest, n_estimators = 'auto',max_iter = 50,)
# Boruta accepts np.array, np.array(y))
# get results
green_area = X.columns[boruta.support_].to_list()
blue_area = X.columns[boruta.support_weak_].to_list()
print('Selected Features:', green_area)
print('Blue area features:', blue_area)
Result of the Boruta Algorithm
mRMR Feature Selection

!pip install mrmr_selection
from mrmr import mrmr_classif
selected_features = mrmr_classif(X=X, y=y, K=2)
Features returned by MRMR with K=2
# top 4 features
top_4 = mrmr_classif(X=X, y=y, K=4)
# top 6 features
top_6 = mrmr_classif(X=X, y=y, K=6)
print("Best 4 features:", top_4)
print("Best 6 features:", top_6)
Features returned by MRMR for k = 4 and k = 6
