Understanding the impact of features and data through Shapley Values

André Segalla
DataLab Log
Published in
13 min readOct 28, 2019

Introduction

With the huge growth of our capability to extract, store and process increasingly higher amounts of data over the last years, the ideal stage for Machine Learning (ML) is set. That is because data is the fuel for powerful models to learn to solve a great variability of tasks by extracting useful information, the so-called insights. From medical prognosis and diagnosis to self-driving cars and face recognition, ML models have been able to surpass human capacity in a great variety of fields, leading to increased adoption of them to assist humans in several important decision-making processes.

There are still limitations to advance in this direction, though. Naturally, in order for humans to rely on ML models to make important decisions that might drastically affect their lives, trust has to be built. Humans, in turn, can only trust algorithms if they understand them, more precisely, if they understand why those algorithms make the predictions they do. The problem is that this is difficult, especially if we are talking about powerful ML models, which are typically so complex that they are regarded as black-box models⁶, meaning basically that we have little or no clue about why they make a certain prediction instead of another. As a recent example that strongly affects Experian, there is the increasing need to explain credit scores to end users, specially due to the transparency requirements of the GDPR. Since this is a central issue in the market, a lot of credit bureaus are still using very simple models (such as Logistic Regression or Decision Tree) to build credit scores because they are easy to explain, and despite their typical much lower accuracy compared to powerful methods such as Deep Neural Networks and Boosting algorithms. Obviously, credit bureaus want desperately to use better models to build credit scores and the only possible way to advance in this direction is to develop methods that can satisfactorily explain their predictions. Other examples of areas besides credit where model explanations are crucial are medicine, self-driving cars, law, or any other area you can think of that can strongly impact humans lives. In face of the relevance of this matter, much effort has been put in explaining the predictions of ML models, giving rise to an area called Explainable AI (XAI).

Besides explaining models predictions, that is, explaining how each input feature impact a model’s output, another recent important topic of research that has been gaining traction is to understand the impact of data itself to a model’s performance. Let us, for instance, take again Experian as an example. In Brazil, Serasa Experian spends a lot of money to buy credit information from Registries across the whole country to build credit models. Obviously, the goal is to build models with the highest performances as possible, and since data is the fuel to build powerful models, one could simply adopt the strategy to get as much data as possible, since, at least in principle, the more data the better. However, this is a very short-sighted way to look at the problem. One could ask, for instance, if the data available of some Registry is indeed valuable, and therefore needed, compared to the data from other Registries. More specifically, Serasa Experian could pose a question like: If I buy the data from a given Registry, will the performance of my model increase? If so, is the price of the data “fair”? By measuring the value of data, the company could be able to adopt more efficient strategies to gather data, potentially reducing costs significantly and even creating better models as low quality data might be avoided.

Perhaps rather surprising, although the problems of model explainability and data valuation are fundamentally different in their goals, they can still be treated under a common framework that has being gaining a lot of attention in the ML community for the last two years, the so called Shapley values⁵.

Shapley Values

The Shapley Values is a concept introduced in the 50's by Lloyd Shapley in the context of cooperative game theory, and has been improved and adapted to different contexts in game theory since then. The idea is the following. In a cooperative game, that is, a situation where players have to cooperate in order maximize the group’s gain (as opposed to a non-cooperative games, where there is no incentive or rule for the players to cooperate with each other) it is important to establish a way to fairly distribute the total gain among the players according to their contribution to the game’s result. From a game theoretic perspective, a player can be any entity capable of taking actions in some context (like humans in a football game, countries in a war, wolves in nature or even something very abstract) in order to maximize the output of a so called characteristic function, which is commonly denoted as payoff. A cooperative game is thus defined by a set of players and by a characteristic function that assigns a real number (payoff) to each possible group of players, also called coalitions. Relevant questions of such games are which coalitions will form and what is the contribution of each player to all possible coalitions, or simply to the game itself.

Let N be a set of players and v be a characteristic function that attributes a real number (payoff) to each coalition SN. In order to attribute a value to each player (ϕ_i, i=1,…, |N|), Shapley postulated a simple set of desiderata⁵ that these values should satisfy, which uniquely fixed them to be a weighted average of their marginal contributions to each coalition:

where the marginal contribution from a player to a coalition is simply the payoff difference obtained between the situation when this player is included in the coalition and excluded from it. So we see, for instance, from the expression above that if a player does not affect the payoff of any coalition, all its marginal contributions are zero, and therefore its contribution to the game is zero. Such a player is called a zero player and is one of the desiderata postulated by Shapley. Also interesting to point out, is that a marginal contribution can be negative, meaning that a player can even harm a coalition (like a football player allowing an own goal). To conclude this part, an important shortcoming to compute the Shapley values is the typically high number of terms in the sum, which is exponential in the number of players, that is, for each player, one has to compute 2^(|N|-1 ) terms, so for all players there is a total of 2^|N| terms to be computed. For instance, if we have a cooperative game with 100 players, that would give a total amount of 1.267.650.600.228.229.401.496.703.205.376 terms to be computed, which is more than the number of stars in the universe (~10³⁰). In light of this exponential complexity, exact computation of the Shapley Values can typically only be carried out in games with a few players, which is an important limitation in the context of ML and requires therefore creative solutions, as we will see shortly.

SHAP — Shapley Additive Explanations

Going back to ML, first we want to understand how we can tackle the problem of explainability with the Shapley Values. This section is based on two main papers¹². Since the Shapley Values is a concept applied to cooperative games, one has to somehow identify this scenario in the context of ML. To do that, imagine a supervised learning scenario, where a model is described by a parametrized function f (x;θ) that maps points from a high dimensional feature space into the real numbers, covering both regression and classification scenarios. Then, given a dataset of inputs and the corresponding labels D={(x_i, y_i)} , x_i∈ ℝ^N, y_i∈ ℝ, i=1,…,n, a machine learning algorithm A learns the parameters of the model θ by minimizing some cost function over the training set D, known as empirical risk minimization. Once the model parameters are learned, one can compute predictions from new sample inputs. Since an input has typically several components, where each component is a so-called feature, the problem of explainability can be formulated as measuring the impact of each feature to its corresponding prediction. Now, here comes the trick: this ML scenario can be described as a cooperative game. How? Just look at the following analogy between ML and game-theoretical concepts:

In words, given a sample input x (coalition), we want to compute the impact of each feature of x (player) to the model’s prediction (payoff). It naturally follows for the importance of the feature i of the input sample x the expression of the Shapley Values:

where f(x_S) is the model evaluated at the subset of components S of x. Since ML models are generally not defined over subsets of components of the input space, the authors² compute them through expected values. The application of Shapley Values in the context of model explainability became known as SHAP values, an abbreviation for SHapley Additive ExPlanations values. The word “additive” stands for the fact that the prediction is additive in the SHAP values, as required by the Shapley postulates:

This means that the prediction of a new input sample x is given by the sum of its SHAP values plus a bias term (ϕ_0). Observe that the SHAP values depend on the model f (obviously!) and on the input sample itself. We say that this type of explanation is local, since each sample has its own explanation, as opposed to global explanations, where the explanations reflect a sort of average behavior of the model (such as the weights of a logistic regression).

Not surprisingly, the problem of computing the SHAP values exactly persists due to the exponential complexity of the sum and here comes a key achievement of the authors². They developed an algorithm capable of computing the SHAP values exactly for tree-based models with logarithmic complexity. That is huge. To get a feeling about what that means, just look at result of a simulation of their paper² comparing the computation time of the SHAP values using Tree SHAP with brute force (computing the sum above explicitly):

We see that while with brute force the time grows about 5 orders of magnitude, with Tree SHAP the time grows only about one order of magnitude in the same range of the number of input features. As I said, that is huge. Moreover, tree-based models are among those models with achievable state of the art performances in several contexts (such as XGBoost). Nevertheless, the authors propose several other implementations¹ to compute the SHAP values for different kinds of algorithms by adapting other known explainability methods, such as LIME and Deep Lift, to satisfy the Shapley postulates. Along with the theoretical development, there is a repository available with the SHAP implementations in Python that also include a rich set of visualization methods and insightful examples. We strongly recommend taking a look into it³.

Data Shapley

Let’s now approach the second problem, that of quantifying the value of data in a supervised learning setting. This second part is based on a more recent paper⁴. To start, we must first realize that the value of data must be somehow related to our final goal, which is to build a model with the best performance as possible. In this sense, the higher the impact of a data point to a model’s performance, the higher should be its value. Let us formalize that idea. Let D denote a training dataset with N sample pairs d_i with inputs x_i and corresponding output labels y_i:

In a supervised learning setting, a model is learned through a learning algorithm A that usually consists in minimizing a cost function defined over the training set. So, A can be regarded as a function that takes the training set D as input and outputs a model f(x;θ). Then, the model’s performance is evaluated in a separate fixed test set. We denote the performance of a model trained in D through algorithm A (and validated in the test set) by V(A, D) or just V(D). So, a naive way to attribute a value to data point d_i would be to compare the performances of the model f trained with and without it. The greater this difference, the more valuable the data point is. This method is known as Leave-One-Out (LOO), but it has an important shortcoming, as mentioned by the authors⁴. Suppose a scenario where the training set contains n copies of each sample and consider using a KNN (with k<=n-1) to predict something using this data. If we remove any sample from the training set, the KNN’s performance would not change since there are still n-1 identical copies of the removed sample in the dataset. In this case, the LOO would attribute 0 value to all data points. A way to circumvent this problem is to compute the performance differences by including and excluding a data point to different subsets of the training set and taking some average. But this is precisely the idea behind the Shapley Values. This means that we can describe this problem from a game-theoretic perspective as well, by making the following analogy:

In words, given a training set D (coalition), we want to quantify the impact of each data point (player) of D to the model’s performance V (payoff). I am sure you can guess what comes next. Yes, that ugly sum again:

The ϕ_i’s in this context were called Data Shapley. The C is an arbitrary constant (because the authors⁴ do not include exactly all Shapley postulates) but they end up fixing it to C=1/N!. Now, again we have to compute this sum somehow, but opposed to the case of SHAP, there are still no very efficient methods available. This might be explained by looking closely to the complexity required to compute V(S). To compute the performance V(S) it is necessary to train the model f in the dataset S, so computing the sum above requires to train a model a number of times that goes exponentially with the number of samples in the dataset, which in the common modern context of ML is unfeasible due to the typically very high number of available training samples (remember, more than 100 can be considered already very high to compute the sum above). Nevertheless, the authors⁴ proceeded by using Monte Carlo and several approximations to speed up computation to get approximated results. They came up with two methods: TMC-Shapley (Truncated Monte Carlo Shapley) and G-Shapley (Gradient Shapley), where the former involves smartly truncating the sum and the last one involves making further approximations to TMC for gradient-based learning algorithms to significantly reduce computation time even further. They conducted two experiments with real data (Breast Cancer and Skin Cancer) and compared both TMC-Shapley and G-Shapley to LOO and a random selection method. Both problems are classification problems, and in each they predicted the respective cancer using Logistic Regression. The idea was to evaluate how the model’s performance changed in four different scenarios: removing high-value data, removing low-value data, adding high-value data and adding low-value successively from the training dataset ranked according to the methods to be compared with each other. Below we see the figure with the results shown in the paper⁴:

For instance, if we look at the first column corresponding to “Removing high value data” for both datasets, the performance of the model falls much faster by successively removing data according to their values ranked by TMC and G-Shapley than with LOO and random selection. Alternatively, we can see in the third column that by successively adding new high valuable samples to the training set the model’s performance increases faster if the data are ranked according to TMC and G-Shapley for both datasets. This method could give us important hints on how to strategically collect new data to improve a model’s performance, by both collecting high-value data and avoiding low-value data.

To conclude, as opposed to SHAP, a shortcoming of the Data Shapley is its still very high computation complexity, which in several applications might be unfeasible to use. A smart way to try to overcome this problem, as the authors do in their work⁴, is to compute the Data Shapley for a small set of data points and use a regressor to learn the relation between this data and its computed values to generalize it to other data points.

Conclusion

This was a brief overview on the recent use of an important and long known concept used in cooperative game theory, the Shapley Values, in the context of ML to both explain a complex model’s prediction, through the SHAP values, and to quantify the value of data, through Data Shapley. SHAP has been widely adopted since its inception in both research and applications. Data Shapley is a more recent technique and there is still a lot of space for further development, at least in the direction of developing more efficient computation methods to enable a more widespread adoption to test it both in research and applications.

References

[1] Lundberg, S. M. and Lee, S.-I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pp. 4765–4774, 2017.

[2] Lundberg, S. M., Erion, G. G., and Lee, S.-I. Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888, 2018.

[3] https://github.com/slundberg/shap

[4] Ghorbani, A. , Zou J. Data Shapley: Equitable Valuation of Data for Machine Learning. Proceedings of Machine Learning Research, V97, 2019. http://proceedings.mlr.press/v97/ghorbani19c/ghorbani19c.pdf

[5] Lloyd S Shapley. A value for n-person games. In: Contributions to the Theory of Games 2.28 (1953), pp. 307–317.

[6] https://towardsdatascience.com/the-black-box-metaphor-in-machine-learning-4e57a3a1d2b0

[7] https://github.com/amiratag/DataShapley

--

--