Feature importance — what’s in a name?

Sven Stringer
Jul 23, 2018 · 11 min read

By Sven Stringer for BigData Republic

Image for post
Image for post
Photo by Maria Molinero on Unsplash

As data scientists we often focus on optimizing model performance. However, sometimes it is just as important to understand how the features in our model contribute to prediction. Unfortunately, the better performing models are also often the more opaque ones. Several methods exist to get some insight in these black box models. In this post I will discuss several feature importance methods, their interpretation, and point you to some useful resources if you need more details.

Why bother?

Even if your code does not contain bugs, the data might. An example of ‘buggy data’ is unrepresentative or biased data. Training data can contain biases that are picked up by the model, and will hurt its performance in the test set . But even more problematic is biased test data, since it biases model evaluation. In that case model interpretation can be used to debug disappointing performance in production.

In addition, some model biases are socially or legally unacceptable. For example, if your model works well in the real world because it implicitly profiles people based on ethnicity, you might run into conflict with your privacy compliance officer. Similarly, in the context of sensitive automated decision making the General Data Protection Regulation (GDPR) applicable in the EU stipulates the right to know what an automated decision was based on.

In short, if your model is designed to do something useful in the real world, invest some time in understanding how the features in your model contribute to model prediction.

Different types of feature importance

One useful distinction is model-specific vs model-agnostic feature importance measures. As the name implies, model-agnostic feature importance measures can in principle be used for any model. Unfortunately implementations of these methods are often not completely general, so it is worth investigating in advance which importance measures are available for your model.

Another important distinction is global vs local feature importance. Local measures focus on the contribution of features for a specific prediction, whereas global measures take all predictions into account. The first relevant, for example, when we want to explain why a specific person was denied a loan from a loan assignment model.

In this post we will see examples of all four classes of feature importance: ensemble tree specific feature importance (local model-specific), permuted feature importance (global model-agnostic), LIME (local model-agnostic), and Shapley values (local model-agnostic).


Model training

The notebook includes the installation of non-standard dependencies and should work out-of-the-box on Google Colab. After downloading the data and some minimal preprocessing we train a random forest model and observe a pretty good predictive capability in the test set: 99% area under the curve; 93% accuracy).

Image for post
Image for post

Model-specific feature importance

Image for post
Image for post

Permutation feature importance

For comparison with the tree-specific feature importance, we are computing feature importance on the training set. As you can see in the figure below, the differences with the previous plot are striking.

Image for post
Image for post

Although the top feature is the same in both methods (maximum number of concave points in a collection of cells), the second most important feature is now ‘maximum value in symmetry’ instead of ‘standard deviation in concave points’. Because maximum value and standard deviation of ‘number of concave points’ are correlated, suggesting that permutation feature importance is less likely to select two correlated top features. It more accurately reflects the added value of a feature given the presence of all other (possibly correlated) features in the model. In addition it focuses on our chosen measure of performance AUC instead of node impurity.

A second advantage of permutation is that it gives us a measure of uncertainty (95% confidence interval in the figure above) of the estimated feature importance. In our example we observe that only four of the thirty features have a significant feature importance. Note that this does not imply that only these four features are important for the model. Although non-significant features might not add much individually given the other features, removing multiple features is likely to decrease the AUC nonetheless. For example, two completely correlated important features individually contribute nothing given the other, but removing both can significantly decrease performance.

The main downside of permutation is that it can be computationally demanding if the number of features is large. Otherwise it is preferable over feature importance methods typically built into tree ensemble methods like random forests and gradient boosting. Permutation feature importance is model agnostic and can be evaluated on new data sets, and focuses on the performance measure of interest.


The main idea of LIME is to compute a local surrogate model. A surrogate model is an easily interpretable model such as a linear model or a decision tree trained to mimic the behavior of the more complex model of interest. In the case of LIME it fits such surrogate models locally. The procedure is as follows. For a specific prediction you want to explain, LIME slightly changes the values to create new data points that are similar. By feeding these perturbed data points to the complex model a relation between the the perturbed features and the model prediction emerges which is then captured by the surrogate model.

Below you see the output of LIME for the first subject in the test set. The fact that mean fractal dimension is larger than 0.16 seems to be major contributor in classifying this the tissue of this subject as malignant. Similarly the fact that mean compactness is larger than 0.03 actually decreased the likelihood of classifying this tissue as malignant. It is striking that maximum concave points is only the seventh feature.

Image for post
Image for post

Although we plotted all 30 features here, the LIME model is often restricted to just a few features since many model stakeholders prefer parsimonious explanations.

LIME shows us what would happen with an individual prediction if it would have different values. For example, the plot above suggests that if this individual would have had a ‘mean fractal dimension’ < 0.16 the model would have been much less likely to predict a malignant tumor. This provides insight in how the model works. It is important though that we do not causally interpret this. Changing the morphology of malignant tissue will not miraculously make it benign, since a tumor’s morphology is a symptom and not a cause of breast cancer. LIME predictions should therefore not be used uncritically when designing interventions in a business process.

In general LIME explanations should be interpreted with care. There are some arbitrary parameters to define how ‘local’ the surrogate model should be fitted. Extremely local models can be more precise but only if the complex model has enough support of training points in that neighbourhood. In practice it is therefore difficult to assess exactly how reliable LIME’s explanations are. However, to answer the question “What would happen to the model prediction if we change this value?” LIME’s explanations can be a good starting point to follow-up on.

Shapley additive explanations

To explain the concept of Shapley values let’s start with a sports analogy. The 2018 FIFA World Cup just finished. France won from Croatia in the finals (4–2) and we might wonder how much each player contributed to France’s victory to fairly distribute a winner’s bonus. It’s clear that the four players that scored had an important part to play. It’s equally clear that the team could not have been successful if all other players would have been exchanged for random players. After all it’s the — sometimes complex — interaction of all players that determines a team’s strength. Therefore it’s difficult to determine an individual player’s added value in the context of a specific team.

So how do we design a fair payout scheme? Shapley values are designed to solve this exact problem. They come from game theory and assumes a game with players and a team score. Suppose we could look at all possible combinations of (a subset of) players replay the game and observe the resulting team score. We could then assign each player a portion of the total payout based on its average added value acrosse all possible subteams to which it was added. This individual payout is the player’s Shapley value. It is the only payout scheme that is proven to be:

  1. Efficient: the sum of the Shapley values of all players should sum up to the total payout
  2. Symmetric: two players should get the same payout if they add the same value in all team combinations
  3. Dummy-sensitive: a player should get a Shapley value of zero if it never improves a subteams performance when it is added.
  4. Additive: in case of a combined payout (say we add two game bonuses). The combined Shapley value of a player across these two games is the sum of the individual game’s Shapley values. This criterion has no relevant analogy in the context of model interpretability.

Instead of a game with players we can use the same payout mechanism for a machine learning model with features. The team score in this context is the performance measure of a (sub)model. The total payout is the difference between a base value — prediction of the null model — and an actual prediction. This difference is then divided over all features in accordance to their relative contribution.

Obviously looking at all possible subsets of features is computationally prohibitive in most realistic models with many features. Instead Shapley value approximations can be computed based on sampling. For Python the shap package has implemented Shapley values as well as a couple of (partly interactive) plots to explore Shapley values across different features and/or subjects.

The figure below shows the distribution of Shapley across subjects in the test set for each feature. Again it shows that maximum concave points is the most discriminative feature. The majority of subjects has a non-zero Shapley value. The sign shows whether the feature value moved the prediction toward malignant (positive SHAP) or benign (negative SHAP). The color reflects the order of magnitude of the feature values (low, medium, high). This way the plot immediately reveals that low (blue) values on the top ten features decrease the likelihood of a malignant tumor prediction.

Image for post
Image for post

The accompagnying notebook shows and explains some additional interactive plots that you may want to play with. For more information about all the plots implemented in the shap package you can read its documentation.


LIME and Shapley values are two local model-agnostic models. LIME’s strength is that it allows exploration of the feature space in specific neighbourhoods of interest. Shapley values are your best bet when you need to explain why specific predictions were made.

Note that although three of the four models discussed were model-agnostic in principle, their implementations in Python often are restricted to a (sometimes large) subset of models. Check out their respective documentation for more details.

There is much more to say about model feature importance and model interpretability. If you want to dive deeper, I highly recommend reading Christoff Molnar’s book Interpretable Machine Learning. It’s freely available on his GitHub page. This blogpost was largely inspired by his book.

About the author




Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store