In machine learning, Feature Selection is the process of choosing features that are most useful for your prediction. Although it sounds simple it is one of the most complex problems in the work of creating a new machine learning model.
In this post, I will share with you some of the approaches that were researched during the last project I led at Fiverr.
You will get some ideas on the basic method I tried and also the more complex approach, which got the best results — removing over 60% of the features, while maintaining accuracy and achieving more stability for our model. I’ll also be sharing our improvement to this algorithm.
Why is it SO IMPORTANT to do Feature Selection?
If you build a machine learning model, you know how hard it is to identify which features are important and which are just noise.
Removing the noisy features will help with memory, computational cost and the accuracy of your model.
Also, by removing features you will help avoid the overfitting of your model.
Sometimes, you have a feature that makes business sense, but it doesn’t mean that this feature will help you with your prediction.
You need to remember that features can be useful in one algorithm (say, a decision tree), and may go underrepresented in another (like a regression model) — not all features are born alike :)
Irrelevant or partially relevant features can negatively impact model performance. Feature Selection and Data Cleaning should be the first and most important step in designing your model.
Feature Selection Methods:
Although there are a lot of techniques for Feature Selection, like backward elimination, lasso regression. In this post, I will share 3 methods that I have found to be most useful to do better Feature Selection, each method has its own advantages.
“All But X”
The name “All But X” was given to this technique at Fiverr. This technique is simple, but useful.
- You run your train and evaluation in iterations
- In each iteration, you remove a single feature.
If you have a large number of features, you can remove a “family” of features — we, at Fiverr, usually aggregate features in different times, 30 days clicks, 60 days clicks, etc. This is a family of features.
- Check your evaluation metrics against the baseline.
The goal of this technique is to see which of the family of features don’t affect the evaluation, or if even removing it improves the evaluation.
The problem with this method is that by removing one feature at a time, you don’t get the effect of features on each other (non-linear effect). Maybe the combination of feature X and feature Y is making the noise, and not only feature X.
Feature Importance + Random Features
Another approach we tried, is using the feature importance that most of the machine learning model APIs have.
What we did, is not just taking the top N feature from the feature importance. We added 3 random features to our data:
- Binary random feature ( 0 or 1)
- Uniform between 0 to 1 random feature
- Integer random feature
After the feature important list, we only took the feature that was higher than the random features.
It is important to take different distributions of random features, as each distribution can have a different effect.
In trees, the model “prefers” continuous features (because of the splits), so those features will be located higher up in the hierarchy. That’s why you need to compare each feature to its equally distributed random feature.
Boruta is a feature ranking and selection algorithm that was developed at the University of Warsaw. This algorithm is based on random forests, but can be used on XGBoost and different tree algorithms as well.
At Fiverr, I used this algorithm with some improvements to XGBoost ranking and classifier models that I will elaborate on briefly.
This algorithm is a kind of combination of both approaches I mentioned above.
- Creating a “shadow” feature for each feature on our dataset, with the same feature values but only shuffled between the rows
- Run in a loop, until one of the stopping conditions:
2.1. We are not removing any more features
2.2. We removed enough features — we can say we want to remove 60% of our features
2.3. We ran N iterations — we limit the number of iterations to not get stuck in an infinite loop
- Run X iterations — we used 5, to remove the randomness of the mode
3.1. Train the model with the regular features and the shadow features
3.2. Save the average feature importance score for each feature
3.3 Remove all the features that are lower than their shadow feature
Here is the best part of this post, our improvement to the Boruta.
We ran the Boruta with a “short version” of our original model. By taking a sample of data and a smaller number of trees (we used XGBoost), we improved the runtime of the original Boruta, without reducing the accuracy.
Another improvement, we ran the algorithm using the random features mentioned before. This is a good sanity or stopping condition, to see that we have removed all the random features from our dataset.
With the improvement, we didn’t see any change in model accuracy, but we saw improvement in runtime. By removing, we were able to shift from 200+ features to less than 70. We saw the stability of the model on the number of trees and in different periods of training.
We also saw an improvement in the distance between the loss of the training and the validation set.
The advantage of the improvement and the Boruta, is that you are running your model. In that case, the problematic features, which were found, are problematic to your model and not a different algorithm.
In this post, you saw 3 different techniques of how to do Feature Selection to your datasets and how to build an effective predictive model. You saw our implementation of Boruta, the improvements in runtime and adding random features to help with sanity checks.
With these improvements, our model was able to run much faster, with more stability and maintained level of accuracy, with only 35% of the original features.
Choose the technique that suits you best. Remember, Feature Selection can help improve accuracy, stability, and runtime, and avoid overfitting. More importantly, the debugging and explainability are easier with fewer features.