By Sven Stringer for BigData Republic
As data scientists we often focus on optimizing model performance. However, sometimes it is just as important to understand how the features in our model contribute to prediction. Unfortunately, the better performing models are also often the more opaque ones. Several methods exist to get some insight in these black box models. In this post I will discuss several feature importance methods, their interpretation, and point you to some useful resources if you need more details.
Before we unpack the concept ‘feature importance’, why is it even important to understand how a model works, especially if it performs well on a test set? In the real world excellent performance on a test set is just a nice start. A skeptical business manager, safety officer, or privacy compliance officer, might rightfully demand some insight into why our model is so successful.
Even if your code does not contain bugs, the data might. An example of ‘buggy data’ is unrepresentative or biased data. Training data can contain biases that are picked up by the model, and will hurt its performance in the test set . But even more problematic is biased test data, since it biases model evaluation. In that case model interpretation can be used to debug disappointing performance in production.
In addition, some model biases are socially or legally unacceptable. For example, if your model works well in the real world because it implicitly profiles people based on ethnicity, you might run into conflict with your privacy compliance officer. Similarly, in the context of sensitive automated decision making the General Data Protection Regulation (GDPR) applicable in the EU stipulates the right to know what an automated decision was based on.
In short, if your model is designed to do something useful in the real world, invest some time in understanding how the features in your model contribute to model prediction.
Different types of feature importance
Now we know why it is important to understand our model’s predictions, we can start unpacking the what. Feature importance measures come in several forms.
One useful distinction is model-specific vs model-agnostic feature importance measures. As the name implies, model-agnostic feature importance measures can in principle be used for any model. Unfortunately implementations of these methods are often not completely general, so it is worth investigating in advance which importance measures are available for your model.
Another important distinction is global vs local feature importance. Local measures focus on the contribution of features for a specific prediction, whereas global measures take all predictions into account. The first relevant, for example, when we want to explain why a specific person was denied a loan from a loan assignment model.
In this post we will see examples of all four classes of feature importance: ensemble tree specific feature importance (local model-specific), permuted feature importance (global model-agnostic), LIME (local model-agnostic), and Shapley values (local model-agnostic).
It is time to get our hands dirty with a concrete example. Suppose we want to classify images of breast tissue into benign and malignant. To train a model to do this we use the Wisconsin diagnosis breast cancer data available at the UCI machine learning repository. It includes diagnosis (benign vs malignant) as well as 30 descriptive cell features of 569 tissue samples. Since we are focusing on feature importance, we will train a single random forest model to keep things simple and observe its performance. In a more realistic setting using some sort of cross-validation scheme would be advisable given the small sample size.
To be able to focus on interpretation and keep this post uncluttered I have decided to only include output plots. However, I do encourage you to investigate this Jupyter notebook code hosted on Google Colab as you read along. The nice thing about Google Colab is that you can run the code even if you just have a browser available.
The notebook includes the installation of non-standard dependencies and should work out-of-the-box on Google Colab. After downloading the data and some minimal preprocessing we train a random forest model and observe a pretty good predictive capability in the test set: 99% area under the curve; 93% accuracy).
Model-specific feature importance
The first type of feature importance we compute is the one implemented by the random forest algorithm in scikit-learn. This is a tree-specific feature importance measure and computes the average reduction in impurity across all trees in the forest due to each feature. That is, features that tend to split nodes closer to the root of a tree will result in a larger importance value. When we plot the feature importance of all features below we see that the most important feature according to the built-in algorithm is maximum amount of concave points found within a tissue. Node splits based on this feature on average result in a large decrease of node impurity. The second most important feature is the standard deviation in concave points, which is by definition somewhat correlated with the first feature. After all large maximum values in number of concave points will result in larger standard deviations.
Permutation feature importance
A model-agnostic approach is permutation feature importance. The idea is simple: after evaluating the performance of your model, you permute the values of a feature of interest and reevaluate model performance. The observed mean decrease in performance — in our case area under the curve — indicates feature importance. This performance decrease can be compared on the test set as well as the training set. Only the latter will tell us something about generalizable feature importance.
For comparison with the tree-specific feature importance, we are computing feature importance on the training set. As you can see in the figure below, the differences with the previous plot are striking.
Although the top feature is the same in both methods (maximum number of concave points in a collection of cells), the second most important feature is now ‘maximum value in symmetry’ instead of ‘standard deviation in concave points’. Because maximum value and standard deviation of ‘number of concave points’ are correlated, suggesting that permutation feature importance is less likely to select two correlated top features. It more accurately reflects the added value of a feature given the presence of all other (possibly correlated) features in the model. In addition it focuses on our chosen measure of performance AUC instead of node impurity.
A second advantage of permutation is that it gives us a measure of uncertainty (95% confidence interval in the figure above) of the estimated feature importance. In our example we observe that only four of the thirty features have a significant feature importance. Note that this does not imply that only these four features are important for the model. Although non-significant features might not add much individually given the other features, removing multiple features is likely to decrease the AUC nonetheless. For example, two completely correlated important features individually contribute nothing given the other, but removing both can significantly decrease performance.
The main downside of permutation is that it can be computationally demanding if the number of features is large. Otherwise it is preferable over feature importance methods typically built into tree ensemble methods like random forests and gradient boosting. Permutation feature importance is model agnostic and can be evaluated on new data sets, and focuses on the performance measure of interest.
The two previous methods assess feature importance across individuals in a dataset. Local interpretable model-agnostic explanations (LIME) is a technique aiming to explain which features are most important in specific areas of the feature space. This can be used to assess for a specific subject which features contributed most to its prediction.
The main idea of LIME is to compute a local surrogate model. A surrogate model is an easily interpretable model such as a linear model or a decision tree trained to mimic the behavior of the more complex model of interest. In the case of LIME it fits such surrogate models locally. The procedure is as follows. For a specific prediction you want to explain, LIME slightly changes the values to create new data points that are similar. By feeding these perturbed data points to the complex model a relation between the the perturbed features and the model prediction emerges which is then captured by the surrogate model.
Below you see the output of LIME for the first subject in the test set. The fact that mean fractal dimension is larger than 0.16 seems to be major contributor in classifying this the tissue of this subject as malignant. Similarly the fact that mean compactness is larger than 0.03 actually decreased the likelihood of classifying this tissue as malignant. It is striking that maximum concave points is only the seventh feature.
Although we plotted all 30 features here, the LIME model is often restricted to just a few features since many model stakeholders prefer parsimonious explanations.
LIME shows us what would happen with an individual prediction if it would have different values. For example, the plot above suggests that if this individual would have had a ‘mean fractal dimension’ < 0.16 the model would have been much less likely to predict a malignant tumor. This provides insight in how the model works. It is important though that we do not causally interpret this. Changing the morphology of malignant tissue will not miraculously make it benign, since a tumor’s morphology is a symptom and not a cause of breast cancer. LIME predictions should therefore not be used uncritically when designing interventions in a business process.
In general LIME explanations should be interpreted with care. There are some arbitrary parameters to define how ‘local’ the surrogate model should be fitted. Extremely local models can be more precise but only if the complex model has enough support of training points in that neighbourhood. In practice it is therefore difficult to assess exactly how reliable LIME’s explanations are. However, to answer the question “What would happen to the model prediction if we change this value?” LIME’s explanations can be a good starting point to follow-up on.
Shapley additive explanations
Like LIME, Shapley values can be used to assess local feature importance. Although they can be used to explain which feature(s) contribute most to a specific prediction, Shapley values are not designed to answer the “what would happen if” questions that LIME’s local surrogate models were designed for.
To explain the concept of Shapley values let’s start with a sports analogy. The 2018 FIFA World Cup just finished. France won from Croatia in the finals (4–2) and we might wonder how much each player contributed to France’s victory to fairly distribute a winner’s bonus. It’s clear that the four players that scored had an important part to play. It’s equally clear that the team could not have been successful if all other players would have been exchanged for random players. After all it’s the — sometimes complex — interaction of all players that determines a team’s strength. Therefore it’s difficult to determine an individual player’s added value in the context of a specific team.
So how do we design a fair payout scheme? Shapley values are designed to solve this exact problem. They come from game theory and assumes a game with players and a team score. Suppose we could look at all possible combinations of (a subset of) players replay the game and observe the resulting team score. We could then assign each player a portion of the total payout based on its average added value acrosse all possible subteams to which it was added. This individual payout is the player’s Shapley value. It is the only payout scheme that is proven to be:
- Efficient: the sum of the Shapley values of all players should sum up to the total payout
- Symmetric: two players should get the same payout if they add the same value in all team combinations
- Dummy-sensitive: a player should get a Shapley value of zero if it never improves a subteams performance when it is added.
- Additive: in case of a combined payout (say we add two game bonuses). The combined Shapley value of a player across these two games is the sum of the individual game’s Shapley values. This criterion has no relevant analogy in the context of model interpretability.
Instead of a game with players we can use the same payout mechanism for a machine learning model with features. The team score in this context is the performance measure of a (sub)model. The total payout is the difference between a base value — prediction of the null model — and an actual prediction. This difference is then divided over all features in accordance to their relative contribution.
Obviously looking at all possible subsets of features is computationally prohibitive in most realistic models with many features. Instead Shapley value approximations can be computed based on sampling. For Python the shap package has implemented Shapley values as well as a couple of (partly interactive) plots to explore Shapley values across different features and/or subjects.
The figure below shows the distribution of Shapley across subjects in the test set for each feature. Again it shows that maximum concave points is the most discriminative feature. The majority of subjects has a non-zero Shapley value. The sign shows whether the feature value moved the prediction toward malignant (positive SHAP) or benign (negative SHAP). The color reflects the order of magnitude of the feature values (low, medium, high). This way the plot immediately reveals that low (blue) values on the top ten features decrease the likelihood of a malignant tumor prediction.
The accompagnying notebook shows and explains some additional interactive plots that you may want to play with. For more information about all the plots implemented in the shap package you can read its documentation.
As we have seen there are many types of feature importance measures. Although many tree-based ensemble methods have conveniently implemented feature importance, be careful when interpreting them. They are biased towards the training set, typically do not focus on your performance measure of interest, and do not always reflect added value in case of correlated features. Permutation feature importance, if not computationally too demanding, can be a good alternative.
LIME and Shapley values are two local model-agnostic models. LIME’s strength is that it allows exploration of the feature space in specific neighbourhoods of interest. Shapley values are your best bet when you need to explain why specific predictions were made.
Note that although three of the four models discussed were model-agnostic in principle, their implementations in Python often are restricted to a (sometimes large) subset of models. Check out their respective documentation for more details.
There is much more to say about model feature importance and model interpretability. If you want to dive deeper, I highly recommend reading Christoff Molnar’s book Interpretable Machine Learning. It’s freely available on his GitHub page. This blogpost was largely inspired by his book.
About the author
Sven Stringer is a data scientist at BigData Republic, a data science consultancy company in the Netherlands. We hire the best of the best in BigData Science and BigData Engineering. If you are interested in using feature importance measures and other machine learning techniques on practical use cases, feel free to contact us at email@example.com.