Boruta Feature Selection Explained in Python
Implementation and explanation from scratch
--
This article aims to explain, the very popular, Boruta feature selection algorithm. Boruta automates the process of feature selection as it automatically determines any thresholds and returns features that are most meaningful in your dataset. Boruta works on the “all-relevant” principle as it provides you with ALL the features that are relevant to your Machine Learning problem.
Need for Feature Selection?
Datasets can contain features that may be completely irrelevant to your problem. These features increase the size of your dataset, add complexity to the artificial intelligence model, and have either, no impact on the output, or worsen the results. It is important to identify these features and remove them before moving on to the training stages.
You can find a little more detail on Feature selection in the following article.
Boruta Algorithm
This algorithm was first introduced as a package for R. It comprises the following steps:
Create copies of the original features by randomly shuffling the features(Shadow Features).
- Concatenate these shadow features to the original dataset.
2. Train this new dataset using the Random Forest Classifier.
3. Check feature importance for the highest-rated Shadow feature.
4. All original features that are more important than the most important shadow feature are the ones that we want to keep.
5. Repeat 3 and 4 for some iterations (20 is a reasonable number) and keep track of the features that appear as important in every iteration.
6. Use binomial distribution to finalize which features provide enough confidence to be kept in the final list.
Before moving on, if you are finding this article helpful, do consider supporting me on Ko-Fi.
Implementation
You can find the complete code in the repository: Boruta Feature Selection
Before any algorithm we obviously need some sort of data to perform feature selection. For this purpose, we will use the same dataset which we used in the last article about feature selection.
This dataset can be found at the following link on Kaggle.
Load and process data
# important librariesimport pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import scipy as sp
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
We will use all of the above libraries above.
data = pd.read_csv("healthcare-dataset-stroke-data.csv")
data.head()
A lot of useless data, nothing we haven’t seen before.
Time for a cleanup. For advanced classes on data cleaning and processing, refer to the following post.
# converting to numericdata["gender"] = pd.factorize(data["gender"])[0]
data["ever_married"] = pd.factorize(data["ever_married"])[0]
data["work_type"] = pd.factorize(data["work_type"])[0]
data["Residence_type"] = pd.factorize(data["Residence_type"])[0]
data["smoking_status"] = pd.factorize(data["smoking_status"])[0]# additional cleaningdata.dropna(inplace =True)
data.drop("id", axis =1, inplace = True)
data.reset_index(inplace=True, drop=True)
data.head()
All freshened up.
# seperate input and output variablesX = data.drop("stroke", axis = 1)
y = data["stroke"]
Separating inputs and outputs.
- Creating Shadow features
For this, we just need to shuffle the original features and concatenate them to the original dataset.
for col in X.columns:
X[f"shadow_{col}"] = X[col].sample(frac=1).reset_index(drop=True)
2. Calculate Importance
def get_important_features(X, y):# Initiliaze Random Forest CLassifier
rf = RandomForestClassifier(max_depth=20)# Fit Random Forest on provided data
rf.fit(X,y)# Create dictionary of feature importances
importances = {feature_name: f_importance for feature_name, f_importance in zip(X.columns, rf.feature_importances_)}# Isolate importances of Shadow features
only_shadow_feat_importance = {key:value for key,value in importances.items() if "shadow" in key}# get importance level of most important shadow feature
highest_shadow_feature = list(dict(sorted(only_shadow_feat_importance.items(), key=lambda item: item[1], reverse=True)).values())[0]# get original feature which fulfill boruta selection criteria
selected_features = [key for key, value in importances.items() if value > highest_shadow_feature]return selected_features
This function trains a Random Forest classifier on our heart stroke dataset. The classifier returns the importance it assigns to each feature in the variable `feature_importances_`.
We then create a dictionary of each feature along with its importance and single out the most important shadow feature.
Finally, it returns the dictionary with all original features which have an importance score greater than the singled out shadow feature.
Now since one trial isn’t enough, we need to run multiple trials to make sure we get satisfactory results.
Multiple Trials
TRIALS = 50feature_hits = {i:0 for i in data.columns}for _ in tqdm(range(TRIALS)): imp_features = get_important_features(X, y) for key, _ in feature_hits.items(): if key in imp_features: feature_hits[key] += 1print(feature_hits)
The results of our 50 runs are as follows
{'gender': 0, 'age': 50, 'hypertension': 0, 'heart_disease': 0, 'ever_married': 0, 'work_type': 0, 'Residence_type': 0, 'avg_glucose_level': 50, 'bmi': 1, 'smoking_status': 0, 'stroke': 0}
Age and avg_glucose_level come at as important, 50 times and BMI comes out important in 1 trial. Now to justify whether BMI appearing important in just 1 trial makes it important or not, we will use a binomial distribution.
Binomial Distribution
The following line of code returns us the probabilities according to a binomial distribution.
# Calculate the probability mass function
pmf = [sp.stats.binom.pmf(x, TRIALS, .5) for x in range(TRIALS + 1)]
A binomial distribution with a probability 0.5 has a bell-shaped curve with 5% of the overall probability in the tails.
First, we need a function that gives us the number of iterations that form the tail.
# trails_in_green_zonedef get_tail_items(pmf):
total = 0
for i, x in enumerate(pmf):
total += x
if total >= 0.05:
breakreturn i
The rules are simple. If the no. of iterations fall in the right tail, we call it the green zone (features that must be kept). If it falls in between the bell shape, we call it the blue zone(Features that can be played around with) and if they are in the right tail, it is called the red zone (features that should be dropped).
Let’s visualize the distribution we created.
# plot the binomial distributionplt.plot([i for i in range(TRIALS + 1)], pmf,"-o")
plt.title(f"Binomial distribution for {TRIALS} trials")
plt.xlabel("No. of trials")
plt.ylabel("Probability")
plt.grid(True)
Final Selection
Now we just need to code the rules which we discussed above, about deciding which features fall into the Green, Blue, and Red zone.
# select features from n number of trialsdef choose_features(feature_hits, TRIALS, thresh): #define boundries
green_zone_thresh = TRIALS - thresh
blue_zone_upper = green_zone_thresh
blue_zone_lower = thresh green_zone = [key for key, value in feature_hits.items() if value >= green_zone_thresh] blue_zone = [key for key, value in feature_hits.items() if (value >= blue_zone_lower and value < blue_zone_upper)] return green_zone, blue_zone
Now run the above functions in the following order
thresh = get_tail_items(pmf)
green, blue = choose_features(feature_hits, TRIALS, thresh)green,blue
As we can see these are exactly the features that we got when we ran the Python implementation in Boruta in the other article.
If you enjoyed this, do check out my other blog posts at:
Also, don’t forget to buy me a Kofi.