Two minutes NLP — Explain predictions with SHAP values

Game Theory, Shapley values, SHAP values, and local interpretability

Fabio Chiusano
NLPlanet
5 min readFeb 18, 2022

--

Charts representing sample SHAP values. Image by the author.

Hello fellow NLP enthusiasts! Today we will see how to use an approach from the explanable AI branch (and Game Theory) called SHAP values, which will allow us to understand why our models make certain predictions. Enjoy! 😄

SHAP (SHapley Additive exPlanations) is an approach inspired by game theory to explain the output of any black-box function (such as a machine learning model), using its inputs. The approach is part of the Explainable AI branch of Artificial Intelligence.

SHAP values are based on Shapley values from game theory. In game theory jargon, considering a game and its players, Shapley values offer a way of measuring the relative contribution of each player to the outcome of the game. In a very similar way in machine learning jargon, considering a model that predicts an outcome from an input sample with its features, SHAP values offer a way of measuring the relative contribution of each feature to the output produced by the model. As a result, SHAP values are about local interpretability of a predictive model, i.e. they explain how each feature influenced the single predictions.

But how do SHAP values compute the importance of each feature?

SHAP values are based on the idea that the output produced by each possible combination of features (where each feature can be used or not be used by the model) should be considered to determine their importance. This means that a separate model should be trained for each possible combination of the available features, always with the same hyperparameters and training data.

Once we have all these models, SHAP values can be computed by analyzing the difference of the model outputs when using and not using each specific feature. I suggest this excellent article by Samuele Mazzanti for an in-depth example of SHAP values computation.

Computing SHAP values in an exact way requires training a huge number of model variations, which is not always possible. However, the shap Python library implements smart approximations and samplings to make the job feasible.

Let’s delve into practice and see how the shap library can be used.

The shap library

You can install the library with pip.

In this example, we’ll classify the sentiment of movie reviews from the IMDB dataset, which is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDB) labeled as positive or negative. We’ll do this by applying a TfidfVectorizer (from sklearn) to the reviews and training a RandomForestClassifier on the TF-IDF scores.

Let’s import the libraries.

We then get the IMDB dataset directly from the shap library, split it into train and test set, apply the vectorizer and fit the random forest model.

Next, we create an Explainer object with the model and the training data, so that it can retrain it if necessary. The Explainer object is then used to compute the SHAP values of the features (i.e. the single words of the reviews) using the test set. This step might take a while.

Eventually, we can plot the SHAP values of the features for each prediction. Let’s consider a positive review now.

SHAP values of the words contained in the review number 6, which is a positive review. Image by the author.

Here is how to interpret the plot:

  • At the bottom, indicated with E[f(X)], we see the average prediction of the model over the test set, which is 0.493 (values smaller than 0.5 are negative reviews, values bigger than 0.5 are positive reviews).
  • At the top, indicated with f(x), we see the prediction of the model for the specific sample, which is 0.73 (i.e. a positive review).
  • How did the model go from the average prediction 0.493 to the specific prediction 0.73 for this specific sample? The increment/decrement brought by each word is shown in the plot. For example, the word “amazing”, with its TF-IDF score of 0.086, contributed to a 0.06 increment of the prediction.
  • Analyzing the increments and decrements brought by all the features, we obtain an explanation of why the model predicted a specific value, starting from the average value.

Let’s do the same with a negative review.

SHAP values of the words contained in the review number 1222, which is a negative review. Image by the author.

This time the model output is shifted from the average prediction 0.493 to the prediction 0.28. The words “worst” and “poor” contributed a lot in this prediction, accounting respectively for 0.06 and 0.05 of the value difference.

It’s also possible to plot the SHAP value of a specific feature for every sample in the test set. Let’s do that for the word “best”.

SHAP values of the word “best” for each sample in the test set. Image by the author.

Each point represents the word “best” in a different sample of the test set. Its x-coordinate represents the TF-IDF value of the specific sample, while the y-coordinate represents its SHAP value.

Almost every time the word “best” is present in a sample (i.e. when its TF-IDF score is greater than zero), it contributes to a shift of the prediction towards the positive label.

It’s interesting to note that the SHAP value of “best” is often negative when the word has zero TF-IDF score: this means that the RandomForestClassifier has learned to give a small shift towards the negative label when the word “best” is not present in the review.

Let’s do the same for the word “worst”.

SHAP values of the word “worst” for each sample in the test set. Image by the author.

As expected, the word “worst” has a negative SHAP value when it’s present in a review and has a small positive SHAP value when it is not present. Interestingly, the SHAP values of “worst” have a higher magnitude with respect to the SHAP values of “best”.

--

--

Fabio Chiusano
NLPlanet

Freelance data scientist — Top Medium writer in Artificial Intelligence