Feature Importance for Any Model using Permutation

Taylor Jensen
9 min readSep 23, 2022

--

The simplest and model-agnostic approach to evaluating feature importance in machine learning models.

Photo by Carlos Muza on Unsplash

Why Feature Importance?

Feature importance is a major part of any model building and evaluation. Without determining what model inputs are the most important, any model is a “black-box” model. Even in the simplest modeling problems, we need some ability to learn what inputs affect model predictions and those that do not.

There are several ways to estimate feature importance from models. However, few are model-agnostic and even fewer are extremely easy to understand. However, permutation importance is not only model-agnostic but intuitive to understand and easy to implement.

Permutation feature importance measures the change in model error (like MAE, r-squared, or accuracy) after a single model feature’s values have been permuted (aka. shuffled).

Before I get to explaining exactly how permutation importance works, let’s review problems with feature importance in some common models.

Problems with Feature Importance in Common Models

The general equation format for linear regression

Linear Regression

You can calculate a representation of feature importance for a linear regression by standardizing features and taking a look at the coefficients of the model.

While this gives you an idea of the relative feature importance of one feature vs. another, this misses out on connecting the impact of the features in terms that can be used to compare against other models, like a Support Vector Machine with a non-linear kernel. In addition, this “coefficient importance” is only in terms of the linear model itself, not in terms of a model metric, such as mean squared error (MSE).

Support Vector Machine

Simple visualization of a Support Vector Machine with a non-linear kernel

The support vector machine (SVM), while powerful, can lack the direct ability to calculate feature importance. There is one key word in the last sentence — “can”. If you use a support vector machine with a linear kernel, the way to calculate feature importance is the same as linear regression. By taking the coefficients of the model, the relative feature importance of the variables come to light. However, this suffers the same drawback as the linear regression as the feature importance is relative to the model itself, not to any specific error metric.

If a non-linear kernel is used, such as the polynomial or radius basis function (RBF) kernel, extracting coefficients is no longer achievable. With non-linear kernels, the support vector machine maps data to a higher feature space, where the data is more likely to be able to be linearly separated. By mapping features to a higher dimension, the linear nature of the model is broken, meaning that the coefficients are not available.

Simple artistic rendering of a Random Forest

Trees & Random Forests

Feature importance in trees is typically defined as the features that contribute most to a decrease in the impurity metric used for the model, such as Gini impurity.

However, fitting a decision tree or random forests over a dataset can be tricky because decision trees are notorious for over-fitting on training datasets.

In addition, the feature importance of a decision tree typically prefers features that have a large quantity of possible values (known as high cardinality). This bias can potentially skew the feature importance and give results that may not make sense, especially if the model is over-fitting on the training data.

The Sci-kit Learn documentation has a really good example of Permutation Importance vs. Random Forest in action, if you’re interested in learning more.

The Benefit of Permutation Feature Importance

Does not require model retraining. I am interested in how the model is using the current column, instead of retraining the model to find new relationships. Permutation feature importance uses only the model that was originally trained.

There are methods that recommend dropping a column and then retraining a model to get an idea of the feature importance. However, not only does this take time to retrain a model, but it allows the model to “cheat” and identify potentially different relationships than it did in the original training, which is not useful in the analysis of feature importance for an existing model.

Captures variable interactions. By randomly shuffling only one column at a time, I see what the model has learned about how variables work together.

This means that when I shuffle a column and another column is dependent on the interaction, I will see a larger change in performance across both column than if the columns were not directly related (aka linearly separated). This means that permutation importance is capturing the holistic model impact by changing only one column’s values.

Model agnostic. Permutation of feature importance across models gives us a better understanding of how a variable impacts a model in real terms. Since permutation feature importance calculates the change in error, it shows how important a feature is across many models.

For example, if there are two models with given error rates, we can see how much those error rates change as different features are permuted. We also get an idea of what the models have learned, regardless of model type. Neural network, linear regression, and all other classes of model are all analyzed the same way, as long as we have a target metric.

Considerations for Permutation Feature Importance

Performance with correlated variables. If the model contains correlated features, then permutation importance can cause confusing results.

If the model has learned to use either of the correlated features in predictions, then the correlated feature is used as a substitute for the data that was randomly switched around. This means that the permutation importance will report a lower importance score for both of the correlated features.

This can mislead you into incorrectly thinking that an important feature is not important, even though the data is being hidden inside of another correlated feature.

Poor performance on over-fit models. When any model is over-fit, the variables can memorize noise in the dataset. If the relationships between the variables are not representative of real life, then the permutated feature importance will not be either.

A sign of overfitting with permutation feature importance is very different feature importance values across both training and testing sets. So, make sure the model performs well across the training and testing set.

Useful for Supervised Learning. For permutation importance to work, there needs to be a metric to compare the permutations to. Otherwise, if the problem is completely unsupervised, there will not be any metric to compare the effect of the column permutations with.

Calculating Permutation Feature Importance

Overall Approach

Calculating permutation feature importance is pretty straightforward, which makes it appealing to use. Permutation feature importance works as follows:

  1. Pick a column
  2. Randomly shuffle the column
  3. Predict results using current model
  4. Compare the existing model performance metrics vs. the model with shuffled column
  5. Repeat for new columns or new iterations
General feature permutation process

With Sci-kit Learn

Lucky for us, Sci-kit Learn provides us with a handy function to calculate feature importance for any model developed with Sci-kit learn or that is compatible with Sci-kit Learn in a single line of code.

A simple example of its use is below. This assumes you have a trained estimator and the data prepared already.

Sci-kit Learn permutation importance snippet

Case Study: Insurance Medical Costs

Let’s review a real-life example using the Medical Cost Personal Dataset with the goal to predict the medical charges that someone might be billed by health insurance.

There are 6 variables of various data types in the dataset and one target column.

The columns are:

  • Beneficiary age
  • Gender
  • Body Mass Index (BMI)
  • Number of dependents covered
  • If the insured person is a smoker
  • Region of the US where the beneficiary resides
  • Target variable: Medical Costs billed by insurer

First, I import the necessary Python libraries:

Then, I import the data and perform some data processing in order to get the data suitable for model training. In this case, I one hot encode the beneficiary residence into 4 columns and encode the binary gender and smoker columns.

The final output of the simple pipeline looks like this:

To save us some time and get right to the modeling comparisons, I’ve done some pre-selection of model types and hyperparameters. I test a linear regression, SVM, and Random Forest and am able to achieve models that do not overfit on severely on the dataset.

For this exercise, good R2 across the data splits are enough to signal the models are not overfitting. The training code for the models is below:

I’m almost positive I could do more to achieve better performance, but these models serve as illustrative examples, so I just want to make sure the results are reasonable. In this case, each model is within 4% R2 different between the training and testing splits.

Modeling Performance Across Training and Testing Sets

After the models are trained, I gather the permutation importance using the mean squared error (MSE) for each model using the Sci-Kit Learn function from earlier:

Then, I combine the results into a visualization so I can compare the permutation importance across each of the three models.

The visualization code outputs the below visuals for the training and testing data, respectively.

Training Data Permutation Importance
Test Data Permutation Importance

Analysis of Results

The results of the permutation feature importance are telling. Since the models are not overfitting on the dataset, the results between the training and testing sets are consistent.

It appears that for each model, two variables are the biggest drivers of model predictions: age and if the insured is a smoker. In all three models, shuffling the data in these columns causes the model mean squared error to increase to varying degrees.

For the age variable, the model that depends on this variable the most appears to be the simple linear regression, with an increase of RMSE of almost 7, which is a significant increase when compared to the rest of the models. The SVM and random forest models also pick up on the importance of this feature, but do not attribute it with as much importance as the linear regression. This is most likely due to the non-linear nature of the models picking up on different relationships inside of the dataset, such as with the smoker_yes column.

For the smoker_yes variable, it the story is the opposite. The linear regression column places little importance on the column, as can be seen by the tiny increase in model error. The SVM and random forest, on the other hand, see RMSE increase by significantly more.

One interesting thing to note is the improvement in linear regression model error with the permutation of the bmi column. Although this improvement in performance is rather small, it is important to consider that the model may have learned some spurious relationship while looking for linear relationships in the dataset. In this case, it would be important to test if model performance would change significantly with the removal of this feature, as well as other features they had little impact on model error.

Summary

In this article, I reviewed of some of the problems with gathering model-specific feature importance. Then, I discussed the benefits permutation importance can add to cross-model feature importance. I also discussed things that we need to remember when using permutation importance, like ensuring we do not have any correlated variables, to ensure it works correctly. Finally, I did a quick data experiment to prove out the approach and compare how different three different models choose their important features and how these features impact the models.

I would also encourage you to check out other model explanation processes like SHAP, LIME (both are model agnostic) and integrated gradients (for Neural Networks).

Interested in more data science content? Follow me on Medium or connect with me on LinkedIn.

--

--

Taylor Jensen

Data Scientist and dedicated nerd in Chicago. All views are my own. LinkedIn: https://bit.ly/3Mq2DYI