# Explainability Technique — Shapley Value

Gone are the days when creating good models is the only thing that we care about. With readily consumable opensource projects and automated machine learning, we are moving towards auditing ML models where explainability is an important aspect. In this article, we will be exploring a game theoretic approach of explainability technique: the **Shapley Value**. Before we get to that part though, we will be taking a quick detour to understand the basics of Game Theory.

**The fundamental**

Game theory to put it simply is a field that studies the behaviors of logical decision-makers mathematically. It has a wide range of applications across multiple disciplines — from explaining your day-to-day phenomenons to shedding a light on questions of a gargantuan scale. At its core, Game Theory consists of three elements:

- A
*game* - The
*players* - The
*rewards*

A game is a situation where the players have a series of choices where each choice is associated with rewards (and costs). The goal is to arrive at the most optimal choice **given **the choices made by other players. There are of course many variations of a game and it can become complicated very quickly. But, it is impossible to talk about game theory without ever mentioning **prisoners’ dilemma** — an example of a simultaneous non-cooperative game where players make decisions well, simultaneously. This however does not mean they have to make these decisions at the ** exact** same time. The key is that each player, cannot observe the choices others make. The classic prisoners’ dilemma has the following conditions:

- There are two players who are being interrogated for a crime
- Each player has only two options: either to
or to*stay silent*his/her accomplice*betray* - Each action has its own reward conditioned on the action chosen by the other party

In this situation, there are four possible permutations (2²):

- Both players stay silent which will reward them with a lighter sentence
- Player A betrays and Player B stays silent which will allow A to go free but B will have to serve a sentence
- Player B betrays and Player A stays silent which will allow B to go free but A will have to serve a sentence
- Both players snitch on each other and so, both of them will get a heavy sentence

The best-case scenario is for both players to stay silent. But, snitching on each other is the** best strategy**. Suppose we are A. If we choose to stay silent, we know that B will have the incentive to betray us because, well who does not want to go freely and blame everything on us? If we choose to betray, we know that logically speaking, B will not choose to stay silent because that means we will go freely and pin everything on B. So, regardless of what we choose, B will betray us, and hence, we should choose to betray. The same analysis can be used if we are B instead. This best strategy is called

**Nash Equilibrium (NE)**. A strategy is NE if there is not any strategy that gives players incentives to switch their decisions. Simply put, the players are in a deadlock. However, it is possible for the best strategy

**not**to be the best case scenario just like in our example. In this case, the best strategy is not

**Pareto efficient**because well, it is not efficient.

**Cooperative games — summary**

Now that we have covered the very fundamental example of a game, we will increase the difficulty by a notch — we will dive into cooperative games, situations where players are allowed to forge alliances to maximise their payoffs. For example, in a group project, we may choose to form a group with the top-performing students to get a higher score. However, since cooperation is possible, one question comes up:

How can we create a solution that yields a fair payout for all of the participating players?

Going back to the group project example, normally there will be peer review sessions where our contributions to the team will be assessed at the very end. The result of the assessment will then be taken into consideration for our final grade. So, in our example, a peer review ensures each participant gets a fair payoff. But things can get a little bit more complicated in other examples. And so, a general solution is needed to ensure players in cooperative games can get fair payoffs. The perfect solution for the problem is one that satisfies four characteristics:

*Efficiency*: The summation of each player’s payoff should be equal to the total available payoff*Symmetry*: For two players that contribute equally, their payoffs should be the same*Dummy*: For a player with zero contribution to the coalition, his/her payoff should be zero*Additivity*: For a player, the summation of his/her payoff at the different subgames should be equal to his/her total payoff of the game

And **Shapley Value**¹ is the **only** solution concept that satisfies all of the above:

Intuitively, the Shapley Value is the weighted average of a player’s marginal contribution over all possible permutations of the coalitions. In a cooperative game, the order in which a player joins a team matters. With this context in mind, there are two permutations that we must consider:

- The permutations of the existing players in the team before player
*i*joins which is captured by the factorial:|S|! - The permutations of the players that join the team after player
*i*which is described by the factorial (p -|S|-1)!

We must also know the marginal contribution of a particular player. If for example, the total payoff of a team is 60, and then when player *i *joins the payoff increases to 100, then his marginal contribution is 100 - 60 = 40. Let us use an example to aid our understanding. Suppose there is a game with the following conditions:

- There are 3 players: A, B, and C
- We are interested in finding the Shapley value for C
- When {C} joins {A,B}, his marginal contribution is 30 —
**permutation A** - When {C} join {A}, his marginal contribution is 40 —
**permutation B** - When {C} joins {B}, his marginal contribution is 20 —
**permutation C** - When {C} creates his own team first, his marginal contribution is 50 —
**permutation D**

When we are dealing with only three players, for each player, there will only be four different unique permutations that we consider out of six (3!) to compute Shapley Value. This is because there is symmetry. Take permutation A for example. From the perspective of C, it does not matter if A forms the team first or B forms the team first. As far as his marginal contribution is concerned, it will still be 30. The same goes for permutation D. It does not matter A joins C first or B joins C first. His marginal contribution will still be 50. By considering the four permutations mentioned above and the formula, the Shapley Value of C is then

# How is it linked to XAI?

One of the main goals of XAI is to explain why certain predictions are made and we can transform this into a game with the following details:

*The game: determining which features are important for a prediction*- The
*players: features used in the model training phase* - The
*rewards: Shapley Values*

We give features the freedom to form coalitions and “play” against each other. The most important feature for a particular prediction will be the feature with the highest Shapley Value. One important step is to determine the size of the pie. In other words, the total payoff that can be shared among the features. In the machine learning context, the size of the pie is the **difference** between the prediction of a particular instance and the average prediction of the dataset. In case you are wondering, Shapely Value works for both classification and regression use cases and is model agnostic. Due to its solid theoretical foundation, Shapley Value is certainly an attractive option for ML explainability. Fortunately, If you want to use it, there is already an open-source implementation:

# Limitations

While the theoretical foundation is solid, Shapley Values is not perfect. It does have its own set of limitations which can be a turn-off in certain applications. In my previous work for example, rather than using Shapley Value, I would opt for simple and explainable algorithms instead such as linear regressions and tree-based models. In this subsection, we will explore some of its limitations.

**Slow computational speed**

As mentioned, Shapely Value considers all possible permutations of the coalitions. As you can imagine, this can be a bottleneck especially if we work with complicated data that has a lot of features. The traditional Shapley Value computation takes exponential time in general. However, a few improvements have been made most notably by using the tree-based model to approximate the Shapley Values (TreeShap). This reduces the computation time to polynomial. Recently, even faster versions of the TreeShap algorithm have been developed². We will have to see if these algorithms can be applied to large datasets with many features.

**Misleading interpretation**

The interpretation of Shapley Value is trickier than it seems. The Shapley value of a feature is **not** the direct effects of the feature on the final prediction outcome. The correct interpretation is: Given the features that we have, the Shapley Value of a feature is the contribution of that feature to the **difference** between the **actual prediction** and the **mean prediction**. This means that we cannot use Shapley Value for analysis such as:

How would the prediction for this particular person change if I changed feature A?

To do such an analysis, we will need a *counterfactual explanation,* not Shapley Value.

**Assuming features independence**

Perhaps one of the most important limitations of Shapley Value lies in its assumption that each feature is independent of one another (zero correlation). This is hardly true in the industry. As a result, when features are correlated, the Shapley Values may not reflect the correct explanation and hence, are misleading. There is an attempt to tackle this problem³. However, at the moment this post is written, I have yet to see the implementation of the algorithm in the current open-source package. And so, implementation from scratch is required which by the way can be a fun side project!

## Conclusion

While Shapley Value has its allure, we must apply it in the right context. As a personal guideline, I would rather choose simple and explainable ML algorithms whenever possible. However, when I find myself in a situation where simple ML algorithms are not enough, Shapley Value certainly can help. The size of the data will also influence my decision. If I am working with a large dataset that requires complex algorithms, I can always consider another explainability approach: **Local Interpretable Model-agnostic Explainability (LIME)**⁴.