Exploring Recursive Feature Elimination

Shraddha Anala
The Startup
Published in
6 min readJun 19, 2020

Python Code to automatically identify relevant feature attributes.

I’m expanding with more posts on ML concepts + tutorials over at my blog!

November 2022 Update: I’d originally written this article about 2 years ago when I was just starting out in Machine Learning. I’m now refreshing my basics again and have discovered that some of the information I’ve shared previously is incomplete and inaccurate, so I’m updating those sections to reflect the concepts more accurately, as well as leaving links to better resources at the end of the article. Thanks for reading!

Hello and welcome to another Machine Learning tutorial in my random dataset series. In this article, I will be talking about using Recursive Feature Elimination to only select important features (or columns) for training our model.

Photo by Markus Spiske on Unsplash

The main motivation behind using feature selection algorithms can be thought of as using fewer features in an attempt to reduce model complexity, feature redundancy, and improve computational performance.

Some models are sensitive to outliers and features that may not encode useful information about the target variable. Some models like the linear regression and logistic regression are sensitive to (multiple) correlated features within the dataset. The aim of feature selection is to then choose only necessary features, and save on computation, and avoid introducing errors.

The dataset we will be using is the Human Activity Recognition dataset from the UCI Machine Learning Repository.

Acknowledgements:

Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24–26 April 2013.

About the Dataset:

The Human Activity Recognition Dataset consists of sensor information of human activities such as walking, sitting, standing etc., performed by a group of volunteers. Using the accelerometer and gyroscope embedded in a Samsung Galaxy SII, the researchers captured 561 features containing information such as, 3-axial velocity, accelaration, and their min, max, mean, standard deviation etc.

As you can imagine, having 561 features to describe 1 activity is a lot of information and it consumes a lot of computing resources. Not to mention, the amount of noise present in the dataset.

It is important for us to only select important features, and in the process, reduce overfitting, redundant features and make subsequent analysis easier.

Recursive Feature Elimination —

Scikit-learn provides an implementation of the Recursive Feature Elimination class to choose those columns that have the most impact on the prediction of the target variable.

In our case, our target variables are the six activities performed by the volunteers and using RFE we can choose which of the columns, such as tBodyAcc-mean()-X or tGravityAccMag-max(), are important.

Recursive Feature Elimination is a wrapper-type feature selection algorithm that requires the user to specify the number of features to keep, as well as a machine learning model. RFE works by fitting the initial machine learning model with all features, ranking features by importance scores, and removing unimportant features. Then the model is re-trained on this reduced feature set, the features are ranked by the importance scores, and the newer unimportant features are removed. This process continues till only the specified number of features remain.

Some models benefit more from RFE than others. Models like random forest, decision trees etc., are more suited to be used with RFE as they generate their own feature importance scores, which is used by RFE in feature elimination. There are other general, model-agnostic methods to calculate the feature importance scores like Mean Absolute SHAP values, permutation importance etc., for models which do not have an internal method to rank feature importances.

As you can see, an important shortcoming of this method is highlighted here.

Let’s jump straight into the tutorial now.

1) Building the DataFrame —

The dataset consists of training, testing and attribute information files. As such we will have to spend a little bit of time arranging columns, extracting column and activity labels from the respective text files.

The training and testing datasets are available in .txt format. We will have to convert it into a CSV file manually in Excel before loading it into a Pandas DataFrame. After saving the CSV file, we will extract the 561 column labels from the features.txt file and the 6 activity labels from the activity_labels.txt file provided in the download folder, using Python.

Add Column names, Activity labels to your DataFrame.

2) Recursive Feature Elimination —

Scikit-learn’s implementation of Recursive Feature Elimination can be used; we would have to specify how many features to retain, and a machine learning model to fit on the training data and find the optimal subset of features.

After this, we will compute the performance of the model on the entire feature set, as well as the reduced feature set to see if RFE has done us any good.

Find optimal features and use them to train & evaluate model.

X and X_t, the transformed training, testing (alliteration alert! Yes I deliberately used a comma instead of ‘and’ for this exact purpose.) subsets now contain only 100 columns as specified in the n_features_to_select parameter.

The retained_columns variable is a list of the chosen columns and below is a screenshot displaying the first 21 out of the 100 retained columns.

Screenshot of 21 chosen columns. Image by the author.

3) Classification Model & Metrics —

Next step is to fit a classification model with the selected features. I have used the Support Vector Classifier for this problem and it gives a very good accuracy of around 95%.

Now you might be wondering what would the accuracy look like if all the columns were used instead of the 100 columns chosen through RFE.

The model trained and tested on the entire set of 561 features showed an accuracy of around 96%.

Model Accuracy without RFE.

While the model trained and tested on the 100 most important features showed an accuracy of 95%.

There is very little difference in the accuracies between the 2 training datasets, illustrating that in this case RFE has been effective to eliminate a lot of irrelevant features while preserving model performance.

With Recursive Feature Elimination, we attempt to reduce overfitting and noise by getting rid of redundant information to the model. The computation speed is increased thereby resulting in quicker training times for the model.

Now let’s take a look at some other evaluation metrics to explore our model’s performance.

Classification Report of the SVC Model. Image by the author.

Precision, Recall, F1 score are some other different metrics that help analyze how well the model has fared.

As can be seen, the support vector classifier is a good fit for this particular problem and it gives an accurate prediction of the type of activity a volunteer is performing.

So that’s it! Hope you liked this tutorial introducing the concept of feature selection (by elimination) and found it useful to see how RFE can be beneficial in use cases with a large number of features, to narrow down on a few.

Here are some resources that are more in-depth and also highlight some of the shortcomings of RFE. No tool or method is completely right all the time, and analysing its features as well as shortcomings is an important skill to develop as a Data Scientist.

You can check out other such interesting concepts in my series here and feel free to take a look at my GitHub Repo.

Thank you so much for reading. Happy Machine Learning!

--

--