Evolutionary Feature Selection for Machine Learning
Learn how to make feature selection for machine learning algorithms using evolutionary models
In a previous post, I discussed why we should use optimization-based algorithms to fine-tune hyperparameters for machine learning models. This also applies to feature selection; you can check it here if you’re interested.
This post will show how to use sklearn-genetic-opt to quickly find the features that maximize cross-validation scores while removing non-informative variables.
In general, it’s not a good idea to use brute force approaches to optimize a model, in the case of feature selection, using methods like forward selection or backward elimination, which can only vary one feature at a time and tends to have troubles when it comes to seeing how different subsets (with the same size) of features work together.
Even if we don’t have to code this from scratch since packages like sklearn-genetic-opt already implemented it, I want to explain the general idea on how we can use evolutionary algorithms to find the features; if you are not familiar with this kind of algorithms, you can check my other post where I explain step by step how they work and how to use them for hyperparameters tuning; or if you want to see the code, you can go to the next section.
1. Model Representation:
We can model the features as follows:
- Each individual of the population represents the total subset of features.
- The gen of the individual represents one particular feature.
- Each gen value can be 0 or 1; zero means the algorithm did not select the feature, and one means the feature is included.
- The mutation is associated with swamping the bit value in the randomly selected position within a mutation probability.
There are ten features in this gen; the algorithm selected features 1, 3, 5, and 10, but the others are not included.
Then with a low enough probability, a random mutation may happen, and now the variable six is also included while feature three was removed; this is helpful, so we help the model to don’t get stuck in local minima and explore more bast regions (subsets of features) without checking all possible combinations of features.
With this representation, we can now apply the regular evolutionary operators to find new solutions; these new solutions will optimize the cross-validation score while minimizing the number of features; we achieve this by using a multi-objective fitness function.
2. Python Code:
For this experiment, I’m going to use a classification dataset. Still, I’m also going to add random noise as new “garbage features” that are not useful for the model and add more complexity. I expect the model to remove them and possibly some of the originals. Hence, the first step is to import the data and create these new features:
From the previous code, you can see there are nine features, four originals, and five dummies; we can plot them to check how they are related to the “y” variable, which we want to predict. Each color represents one of the categories.
We can see that the original features help to discriminate the observations of each class having a boundary that separates them. Still, the new features (dummies) don’t add value since they cannot “split” the data per category, just as expected.
Now, we will split the data into train and test and import the base model we want to use to select the features, in this case, a decision tree.
As a next step, let’s import and fit the feature selection model; as mentioned, it uses evolutionary algorithms to select the features; it uses a multi-objective function by optimizing the cross-validation score while also minimizing the number of features used.
Make sure to install sklearn-genetic-opt
pip install sklearn-genetic-opt
We are setting the model to choose the features based on the cross-validation accuracy, but any scikit-learn metric is available, we also set the cv strategy to be stratified k-fold.
When you start to fit the model, a log will be displayed on the screen to see the optimization progress, here there is a sample of it, in this case, the “fitness” is equivalent to the average cross-validation score across the generation (row), we also get the standard deviation, the maximum and minimum value.
Once the model is done, we can check which variables it chooses by using the best_features_ property, it will get an array of bools, where true means the feature at that index was selected.
evolved_estimator.best_features_[False True True True False False False False False]
In this particular run, the model got around 97% accuracy. It chooses features 2, 3, and,4 which are part of the original, informative features; it discards all the dummy features for the first one.
If you want to learn more about this package, let me know, you can also check here the documentation and source code: