Feature Importances: Which features really matter to my model?

Published in

Legiti

7 min readOct 19, 2020

In this post, we will present a little bit about the overall intuition behind Permutation Importance, a simple but very efficient technique that we have been using here at Legiti. This approach allows us to evaluate the impact of each feature on the performance of our models. It has been an invaluable tool to understand which features are helping the most in our fight against fraud.

Why you should care

Getting the first trained model that achieves good performance on historical data is a very important step, however, it is far from being the end of our work. In many cases, ours included, after deploying the initial model to production, multiple model iterations will still be needed. At Legiti, it is a continuous process that never really ends. To help in the iterations it is very useful to know how each feature is contributing to the model performance. Also, especially for us, those insights are critical when we consider the development and computation costs of using new features in the production models. We are an anti-fraud solution, thus our model inferences are expected to happen in an online setting under tight restrictions in response time. As a consequence, we need to be very careful about each new feature we decide to add, not only regarding its impact on the model performance but also its potential influence on our general response time on inference.

How you can do it

A very common approach to evaluating feature importance is to rely on the coefficients of a linear model, a very straightforward method where you simply interpret their absolute values as importances. However, models based on ensembles of trees have become ubiquitous and it is common for data scientists to experiment with different classes of models. Because of that, a model agnostic method would be highly preferred, so we could apply the same procedure regardless of the specific model we decide to use.

Permutation Importance is a model agnostic technique that ends up solving the problem for us. But to understand the intuition behind it, it might be helpful to first look at another simpler but very similar approach, the Leave One Feature Out.

Leave One Feature Out

Intuitively, the technique just tries to answer the following question:

How much worse would the model be if a given feature was not present?

From that, we interpret that the contribution of a feature to the model will be inversely proportional to how much worse the model will be without it. Notice that, answering this question could also inform the opposite, the absence of the feature may improve the model performance, which we could interpret as a negative contribution. More concretely, the Leave One Feature Out will answer that question with the following algorithm:

1. With all features, train and evaluate the model performance, this performance value will be our “baseline”
2. For each feature we:
- Exclude only this feature from the dataset, train and evaluate the model
- Then take the difference between the baseline and the new performance
- Multiply the resulting value by minus one
- Now we have the importance value for that feature
3. At the end we just sort the features by its importance values, so we can rank their relative importance

At first, it seemed that this algorithm would already provide what we needed.

However, Its computation costs make it an impractical alternative for us. A single backtest run that would train & evaluate a model on all historical data takes in our case several minutes to complete. Given that our models usually use a couple of hundreds of features, to loop through all the features would be very time-consuming.

Another limitation of this method is the case in which we will have two or more very highly correlated features, they may just end up replacing each other in the model and would yield very low importances even if they are in fact very important. In an extreme case, we could imagine that if we had two identical features, both could yield importance near to 0.

Fortunately, these are both constraints that the Permutation Importance can solve for us.

Permutation Importance

Permutation Importance will still use the same general approach from Leave One Feature Out. It will try to estimate a feature’s importance relative to how much worse the model would be without it. However, it differs in how it handles feature exclusion. To avoid the taxing computation costs, instead of excluding the feature and re-train the whole model, it just makes the feature column non-informative by randomizing its values. Instead of eliminating, the idea is to break the existing relationship between the feature and the target variable, so it would cease to provide useful information that would help in the prediction task. We are rephrasing the question a little bit as:

How much worse would the model be if a given feature became non-informative?

The only additional issue that still needs to be taken care of is the randomization. The model was trained assuming a very specific distribution of values for each feature, which means that values will be expected to be within a specific range of domain values (e.g. a label encoded categorical feature with integer values from 0 to 4 should not be assigned a value of 42). To achieve that, given that a dataset will have multiple observation rows, we just randomly permute the values on that feature column. Then we will have a new pseudo-random value for each row, while at the same time will still be keeping the domain values correct. Thus our general algorithm becomes:

1. With all features, train and evaluate the model performance, this performance value will be our “baseline”
2. For each feature we:
- Randomly permute the feature values on that column, make a new prediction using the new values of features, and evaluate the model (notice that no model re-training will be needed here)
- Then take the difference between the baseline and the new performance
- Multiply the resulting value by minus one
- Now we have the importance value for that feature
3. At the end we just sort the features by its importance values, so we can rank their relative importance

Now we can still compute feature importance estimates, but with a cost of a single backtest run for the whole feature set. Also, for highly correlated features, its importances won’t be nullified by each other. In an extreme case, if we have two identical features, the total importance will be distributed between the two of them.

Practical considerations

Because of the stochastic nature of this technique, the feature importances will have some level of variance between each different execution (or between each different seed value, if you use them when generating random numbers). To have better confidence in the estimates we may want to have a more stable measure, we can do that by running this algorithm multiple times, (with different random seeds, if you use them) and then take the average of the importances. However, the computation time will increase a bit, so it will be a trade-off we will need to make between the metric stability and the additional computation cost.

Another important thing to remember is to use separate training and validation sets for this procedure, and to evaluate the feature importances only on the validation set. Otherwise, we would not be generating estimates that generalize to unseen data in production, which is usually the goal for this whole method.

Finally, if you happen to be using only linear models, it might be worth it relying on the linear coefficients instead, as it would incur zero computation costs and their relationship with the outputs could be somewhat simpler to understand.

Existing implementations

For concrete usage in Python, there are good open-source implementations for the Permutation Importance, which are well tested and supported. ELI5 is a package focused on model interpretation techniques, which includes a module for Permutation Importance. More recently, scikit-learn has also added a module for Permutation Importance, this is the actual implementation we chose to use given that we already use a lot from this package. Also, because of its simplicity, implementing the algorithm from scratch could be another reasonable option, especially if you would want to customize some aspects of the algorithm.

How we use it at Legiti

Currently, the permutation feature importances are the main feedback mechanism we use at Legiti for decisions regarding features. Both to evaluate which features would be most beneficial to add to our production models, and to validate our hypotheses regarding our intuitions on new features we are exploring.

Another interesting usage we have been considering is to integrate it into our feature selection process with Optimus. Optimus is our in-house Auto-ML module for feature selection and hyper-parameter optimization. If you are interested to know a bit more, you are welcome to also check the article we wrote about it.

Summary

In this post, we gave an overview of the Permutation Importance technique. We have mostly focused on the overall intuition behind the algorithm, but if you are still interested you can find the complete details from the paper. Checking both the code and documentation in ELI5 and scikit-learn packages might also help bring a more concrete understanding of the mechanisms.

This has been an exceptionally useful tool to help in fighting fraud here at Legiti, but we believe it would also be as useful for any other predictive challenge.