Interpreting complex models with SHAP values

Note: This post was originally published on the Canopy Labs website, and describes work I’ve been lucky to do as a data scientist there.

An important question in the field of machine learning is why an algorithm made a certain decision.

This is important for a variety of reasons. As an end user, I am more likely to trust a recommendation if I understand why it was exposed to me. As an organization, understanding that customers made a purchase because this campaign was particularly effective can allow me to tailor my future outreach efforts.

However, this is a challenging and still developing field in machine learning. In this post, I am going to discuss exactly what it means to interpret a model, and explore a novel technique called SHAP (https://github.com/slundberg/shap) which is particularly effective at allowing us to take the hood off complex algorithms.

What does it mean to interpret a model (and why is it so hard)?

Let’s start by defining exactly what it means to interpret a model. At a very high level, I want to understand what motivated a certain prediction.

For instance, lets reuse the problem from the XGBoost documentation, where given the age, gender and occupation of an individual, I want to predict whether or not they will like computer games:

In this case, my input features are age, gender and occupation. I want to know how these features impacted the model’s prediction that someone would like computer games.

However, there are two different ways to interpret this:

  1. On a global level. Looking at the entire dataset, which features did the algorithm find most predictive? XGBoost’s get_score() function - which counts how many times a feature was used to split the data – is an example of considering global feature importance, since it looks at what was learned from all the data.
  2. On a local level. Maybe, across all individuals, age was the most important feature, and younger people are much more likely to like computer games. But if Frank is a 50-year-old who works as a video game tester, it’s likely that his occupation is going to be much more significant than his age in determining whether he likes computer games. Identifying which features were most important for Frank specifically involves finding feature importances on a ‘local’ – individual – level.

With this definition out of the way, let’s move on to one of the big challenges in model interpretability:

Trading off between interpretablity and complexity

Let’s consider a very simple model: a linear regression. The output of the model is

In the linear regression model above, I assign each of my features x_i a coefficient ϕ_i, and add everything up to get my output. In the case of my computer games problem, my input features would be (x_Age, x_Gender, x_Job).

In this case, its super easy to find the importance of a feature; if ϕ_i has a large absolute value, then feature xi had a big impact on the final outcome (e.g. if ∣ϕ_Age∣ is large, then age was an important feature). However, there is also a drawback, which is that this model is so simple that it can only uncover linear relationships.

For instance, maybe age is an important feature, and if you’re between 12 and 18 you’re much more likely to like computer games than at any other age; since this is a non-linear relationship, a linear regression wouldn’t be able to uncover it.

In order to uncover this more complicated relationship, I’ll need a more complicated model.

However, as soon as I start using more complicated models, I lose the ease of interpretability which I got with this linear model. In fact, as soon as I try to start uncovering non-linear, or even interwoven relationships — e.g. what if age is important depending on your gender? — then it becomes very tricky to interpret the model.

This decision — between an easy to interpret model which can only uncover simple relationships, or complex models which can find very interesting patterns that may be difficult to interpret — is the trade off between interpretability and complexity.

This is additionally complicated by the fact that I might be interpreting a model because I’m hoping to learn something new and interesting about the data. If this is the case, a linear model may not cut it, since I may already be familiar with the relationships it would uncover.

The ideal case would therefore be to have a complex model which I can also interpret.

How can we interpret complex models?

Thinking about linear regressions has yielded a good way of thinking about model interpretations:

I’ll assign to each feature x_i a coefficient ϕ_i which describes — linearly — how the feature affects the output of the model. We’ve already discussed the shortcomings of this model, but bear with me:

Across many data points, the coefficients ϕ will fail to capture complex relationships. But on an individual level, then they’ll do fine, since for a single prediction, each variable will truly have impacted the model’s prediction by a constant value.

For instance, consider the case of Frank, the 50-year-old video game tester who loves computer games. For him, ϕ_Job will be high and ϕ_Age will be low.

But then, for Bobby, a 14-year-old, ϕ_Age will be high since the model has see that 14-year olds tend love computer games because they are 14 years old.

What we’ve done here is take a complex model, which has learnt non-linear patterns in the data, and broken it down into lots of linear models which describe individual data points. Its important to note that these explanation coefficients ϕ are not the output of the model, but rather what we are using to interpret this model. By aggregating all of these simple, individual models together, we can understand how the model behaves across all the customers.

So, to sum up:

Instead of trying to explain the whole complex model, I am just going to try and explain how the complex model behaved for one data point. I’ll do this using a linear explanation model; let’s call it g.

In addition, to further simplify my simple model, I won’t multiply the coefficients ϕ by the original feature value, x. Instead, I’ll multiply it by 1 if the feature is present, and 0 if it is not.

In the case of predicting who loves computer games, what I therefore get is the following:

where g_Frank=p_Frank, the original prediction of the model for Frank.

Note that the coefficients apply only to Frank; if I want to find how the model behaved for Bobby, I’ll need to find a new set of coefficients. In addition, since Bobby doesn’t have a job, I multiplied ϕ_Bobby Job by 0 (since there isn’t an x_Bobby Job). His simple model will therefore be

I’ll do this for all the data points and aggregate it to get an idea of how my model worked globally.

Now that I have this framework within which to interpret complex models, I need to think about exactly what properties I want ϕ to capture to be useful.

Shapley values (or, how can I calculate ϕ?)

The solution to finding the values of ϕ predates machine learning. In fact, it has its foundations in game theory.

Consider the following scenario: a group of people are playing a game. As a result of playing this game, they receive a certain reward; how can they divide this reward between themselves in a way which reflects each of their contributions?

There are a few things which everyone can agree on; meeting the following conditions will mean the game is ‘fair’ according to Shapley values:

  1. The sum of what everyone receives should equal the total reward
  2. If two people contributed the same value, then they should receive the same amount from the reward
  3. Someone who contributed no value should receive nothing
  4. If the group plays two games, then an individual’s reward from both games should equal their reward from their first game plus their reward from the second game

These are fairly intuitive rules to have when dividing a reward, and they translate nicely to the machine learning problem we are trying to solve. In a machine learning problem, the reward is the final prediction of the complex model, and the participants in the game are features. Translating these rules into our previous notation:

  1. g_Frank should be equal to p_Frank, the probability the complex model assigned to Frank of liking computer games.

2. If two features x contributed the same value to the final prediction, then their coefficients ϕ should have the same value

3. If a feature contributed nothing to the final prediction (or if it is missing), then its contribution to g should be 0

4. If I add up g_(Frank+Bobby) then this should be equal to g_Frank+g_Bobby

It’s worth noting that so far, our simple model by default respects rules 3 and 4.

It turns out that there is only one method of calculating ϕ so that it will also respect rules 1 and 2. Lloyd Shapley introduced this method in 1953 (which is why values of ϕ calculated in this way are known as Shapley values).

The Shapley value for a certain feature i (out of n total features), given a prediction p (this is the prediction by the complex model) is

There’s a bit to unpack here, but this is also much more intuitive than it looks. At a very high level, what this equation does is calculate what the prediction of the model would be without feature i, calculate the prediction of the model with feature i, and then calculate the difference:

This is intuitive; I can just add features and see how the model’s prediction changes as it sees new features. The change in the model’s prediction is essentially the effect of the feature.

However, the order in which you add features is important to how you assign their values. Let’s consider Bobby’s example to understand why; it’s the fact that he is both 14 and male that means he has a high chance of liking computer games. This means that whichever feature we add second will get a disproportionately high weighting, since the model will see that Bobby is a really likely candidate for liking computer games only when it has both pieces of information.

To better illustrate this, lets imagine that we are trying to assign feature values to the decision tree from the XGBoost documentation. Different implementations of decision trees have different ways of dealing with missing values, but for this toy example, lets say that if a value the tree splits on is missing, it calculates the average of the leaves below it.

As a reminder, here is the decision tree (with Bobby labelled):

First, we’ll see Bobby’s age, and then his gender.

When the model sees Bobby’s age, it will take him left on the first split. Then, since it doesn’t have a gender yet, it will assign him the average of the leaves below, or (2 + 0.1) / 2 = 1.05. So the effect of the age feature is 1.05.

Then, when the model learns he is male, it will give him a score of 2. The effect of the gender feature is therefore 2−1.05=0.95.

So in this scenario, ϕ_Age Bobby=1.05 and ϕ_Gender Bobby=0.95.

Next, lets say we see his gender, and then his age.

In the case where we only have a gender, the model doesn’t have an age to split on. It therefore has to take an average of all the leaves below the root.

First, the average of the depth 2 leaves: (2 + 0.1) / 2 = 1.05. This result is then averaged with the other depth 1 leaf: (1.05 + (-1)) / 2 = 0.025. So, the effect of the gender feature is 0.025.

Then, when the model learns he is 14, it gives him a score of 2. The effect of the age feature is then (2–0.025)=1.975.

So in this scenario, ϕ_Age Bobby=1.975 and ϕ_Gender Bobby=0.025.

Which value should we assign ϕ_Age Bobby? If we assign ϕ_Age Bobby a value of 1.975, does this mean we assign ϕ_Gender Bobby a value of 0.025 (since, by rule 1 of Shapley fairness, the total coefficients must equal the final prediction of the model for Bobby, in this case 2)?

This is far from ideal, since it ignores the first sequence, in which ϕ_Gender Bobby would get 0.95 and ϕ_Age Bobby would get 1.05.

What a Shapley value does is consider both values, calculating a weighted sum to find the final value. This is why the equation for ϕ_i(p) must permute over all possible sets of S of feature groupings (minus the feature i we are interested in). This is described in SN/i below the summation, where N is all the features.

How are the weights assigned to each component of the sum? It basically considers how many different permutations of the sets exist, considering both the features which are in the set S (this is done by the ∣S∣!) as well as the features which have yet to be added (this is done by the (n−∣S∣−1)!. Finally, everything is normalized by the features we have in total.

Calculating a Shapley value

For Bobby, what would the Shapley value be for his age?

First, I need to construct my sets S. These are all possible combinations of Bobby’s features, excluding his age. Since he only has one other feature — his gender — this yields two sets: {x_Gender}, and an empty set {}.

Next, I need to calculate ϕ_i(p) or each of these sets, S. Note than as I have 2 features, n=2.

In the case where S = {}:

The prediction of the model when it sees no features is the average of all the leaves, which we have calculated to be 0.025. We’ve also calculated that when it sees only the age, it is 1.05, so

This yields

In the case where S = {x_Gender}:

We’ve calculated that the prediction of the model with only the gender is 0.025, and then when it sees both his age and his gender is 2, so

So

Adding these two values together yields

Note that this value makes sense; its right in the middle of what we calculated when we calculated feature importance just by adding features one by one.

In summary, Shapley values calculate the importance of a feature by comparing what a model predicts with and without the feature. However, since the order in which a model sees features can affect its predictions, this is done in every possible order, so that the features are fairly compared.

Shap values

Unfortunately, going through all possible combinations of features quickly becomes computationally unfeasible.

Luckily, the SHAP library introduces optimizations which allow Shapley values to be used in practice. It does this by developing model specific algorithms, which take advantage of different model’s structures. For instance, SHAP’s integration with gradient boosted decision trees takes advantage of the hierarchy in a decision tree’s features to calculate the SHAP values.

This allows the SHAP library to calculate Shapley values significantly faster than if a model prediction had to be calculated for every possible combination of features.

Conclusion

Shapley values, and the SHAP library, are powerful tools to uncovering the patterns a machine learning algorithm has identified.

In particular, by considering the effects of features in individual datapoints, instead of on the whole dataset (and then aggregating the results), the interplay of combinations of features can be uncovered. This allows far more powerful insights to be generated than with global feature importance methods.

Sources

S. Lundberg, S Lee, A Unified Approach to Interpreting Model Predictions, 2017