# Explaining the predictions— Shapley Values with PySpark

## Interpreting Isolation Forest’s predictions — and not only

# The problem: how to interpret Isolation Forest’s predictions

More specifically, how to tell **which features are contributing more** to the predictions. Since Isolation Forest **is not a typical Decision Tree** (see Isolation Forest characteristics here), after some research, I ended up with **three possible solutions**:

1) Train on the same dataset **another similar algorithm **that has feature importance implemented and is more easily interpretable, like Random Forest.

2) **Reconstruct the trees** as a graph for example.** The most important features should be the ones on the shortest paths of the trees**. This is because of how Isolation Forest works: the anomalies are few and distinct, so it should be easier to single them out — in fewer steps, thus, shortest paths.

3) Estimate **the Shapley values**: the marginal feature contribution, which is a more standard way of identifying feature importance ranking.

Options 1 and 2 were not deemed as the best solutions to the problem, mainly because of the difficulties in how to pick a good algorithm — Random Forest, for example, operates differently than Isolation Forest, so it wouldn’t be very wise to use its feature importance output. Additionally, option 2 does not generalize and it would be like trying to do the algorithm’s work all over again.

For option 3, while there are a few libraries, like shap and shparkley, I had many issues using them with *spark IForest*, regarding both usability and performance.

The solution was to implement Shapley values’ estimation using Pyspark, based on the Shapley calculation algorithm described below.

The implementation takes **a trained pyspark model, the spark dataframe with the features, the row to examine, the feature names, the features column name and the column name to examine, e.g. prediction**. The output is **the ranking of features, from the most to the least helpful**.

# What are the Shapley values?

The Shapley value provides a principled way to explain the predictions of nonlinear models common in the field of machine learning. By interpreting a model trained on a set of features as a value function on a coalition of players, Shapley values provide a natural way to compute which features contribute to a prediction.[14] This unifies several other methods including Locally Interpretable Model-Agnostic Explanations (LIME),[15] DeepLIFT,[16] and Layer-Wise Relevance Propagation.[17]

source: https://en.wikipedia.org/wiki/Shapley_value#In_machine_learning

Or as the python shap package states:

A game theoretic approach to explain the output of any machine learning model.

In other words, it is the average of the marginal contributions across all feature permutations. The following article has a good way of explaining what the marginal contribution is:

You can find the relevant code here.

From the article below with an example from game theory: *“finding each player’s marginal contribution, averaged over every possible sequence in which the players could have been added to the group”*. Thus, finding each feature’s marginal contribution, averaged over every possible sequence in which the features could have been selected. I’ll provide an algorithm and a visual example shortly :)

# Estimating Shapley Values with Pyspark

## An approximation with Monte-Carlo sampling

First, select an instance of interest x, a feature j and the number of iterations M.

For each iteration, a random instance z is selected from the data and a random order of the features is generated.

Two new instances are created by combining values from the instance of interest x and the sample z.

The instance x+j is the instance of interest, but all values in the order before feature j are replaced by feature values from the sample z.

The instance x−j is the same as x+j, but in addition has feature j replaced by the value for feature j from the sample z.

The difference in the prediction from the black box is computed

All these differences are averaged and result in:

The advantages of using Shapley values:

- “The difference between the prediction and the average prediction is
**fairly distributed**among the feature values of the instance. […] In situations where the law requires explainability — like EU’s “right to explanations” — the Shapley value might be the only legally compliant method, because it is based on a solid theory and distributes the effects fairly.” . This is a very important advantage I believe. - You can use a sample of the dataset to compare the Shapley values.
- The Shapley value is the only explanation method with a
**solid theory**

## The disadvantages of using Shapley values:

- As one can imagine, the calculation of Shapley values requires
**a lot of computing time.**Decreasing the number of iterations (M) reduces computation time, but**increases the variance**of the Shapley value. It makes sense that**the iterations number should be large enough to accurately estimate the values, but small enough to complete the computation in a reasonable time**. This is one of the main reasons for a scalable solution.

This article won’t focus on the ways to choose the right M, but it

will provide a possible solution to use the whole dataset, or the greater part of it, for the calculationin a relatively reasonable time.

- You need both the
**model**and the**data**(feature vectors) to calculate them. **It can be misinterpreted**. This is one issue I struggled with myself.*“The Shapley value of a feature value is not the difference of the predicted value after removing the feature from the model training.”*Instead, one should view the values as the contribution of a feature value to**the difference between the actual prediction and the mean prediction.**- It is
**not suitable for sparse solutions**—meaning sparse vectors with few features within a sample.**All the features are needed.**

# Using Pyspark

## First of all, why Pyspark?

Because of the following statement:

“…

the iterations number should be large enough to accurately estimate the values, but small enough to complete the computation in a reasonable time … “

Estimating Shapley Values requires **a lot of computing time** and if you are using Spark models, there’s a good chance your dataset ranges from big-ish to huge. So, it only makes sense to use Spark to help spread the computation to different workers and make things as efficient as possible. I mean, sure we’ll be using a udf for some part of the calculation, but even so, this will run on different workers, so in parallel.

It only makes sense to use Spark to help spread the computation to different workers and make things as efficient as possible

## Ok then, how?

To calculate the Shapley values for all features following the algorithm description above using pyspark, the algorithm below was used:

Let’s start with a dataframe that has an ID column and features F1, F2, F3, F4.

The ID doesn’t have to be sequential but it has to be unique and can be generated with the ways described here. The reason for the ID column is because we need a `row of interest`

so we need to identify this somehow within the dataset. We can also use a `is_selected`

column (as you’ll notice in the code), but let’s go on with the explanation of the basic algorithm first.

The features can reside in one column e.g. `features`

or in separate, it doesn’t really matter. In our example, let’s have them separate so that it is easier to see the values.

So, `M`

consists of **all the rows**, each of the row is a `z`

, so we have `z1, z2 ... zM`

and let’s randomly choose `zM-3`

, the fourth row from the end to be our row of interest `x`

.

Note: `x`

is filtered out as it doesn’t make sense to calculate `xj`

for it

Since we are going through all `M`

rows, we need a feature permutation for each. The following function does exactly that. Given an array of feature names, and a dataframe df, it creates a column `features_permutation`

with a feature permutation (list of features in some order) for each row.

Once we get the permutations we can start iterating the features to calculate `x+j`

and `x-j`

. Because we’ll be using a udf for this, we’ll be calculating both in one udf under one column `xj`

for performance reasons (udfs can be expensive in pyspark).

Let’s start with j = F1.

The udf needs **the name of the selected feature**, `feature j`

, the values of `z1`

, and the feature permutation that corresponds to `z1`

. It also needs the ordered feature names and the row of interest values, `x`

. The last two can be broadcasted.

So, for the following z1, x, feature permutation [F2, F3, F1, F4], the xj for j=F1 is calculated as seen below:

The udf, as seen below, practically iterates the current feature permutation from where `j=F1`

is, so `[F2, F3, F1, F4]`

in our example becomes `[F1, F4]`

, `x+j = [x_F1, Z1_F2, Z1_F3, X_F4] `

and `x-j = [Z1_F1, Z1_F2, Z1_F3, X_F4]`

.

A list with two elements, a dense vector for `x-j`

and a dense vector for `x+j`

is returned.

One important note here is that the output of the udf is a list containing the `x-j`

and the `x+j`

vectors** in this order. **This is important because of the way we are going to be calculating the marginal contribution, using Spark’s function `lag`

, which we’ll see in a minute.

Once we have `xj`

calculated, we can **explode** `xj`

(which means the df will double in number of rows), so that we have a **features column** that contains either `x-j`

or `x+j`

. Rows that belong to the same `z`

have **the same id **in the exploded df, **which is why we can use lag over id** **next**, to calculate the marginal contribution for each `z`

.

We use the model to predict on `xdf`

first:

And next we calculate each `z`

marginal contribution:

*# marginal contribution is calculated using a window and a lag of 1.*

# the window is partitioned by id because x+j and x-j for the same row

# will have the same id

x_df = x_df.withColumn(

**'marginal_contribution'**,

(

F.col(column_to_examine) - F.lag(

F.col(column_to_examine), 1

).over(Window.partitionBy(**'id'**).orderBy(**'id'**)

)

)

)

## The complete code

# How to use it

The following code generates a random dataset of 6 features, `F1, F2, F3, F4, F5, F6`

, with labels `[0, 1]`

and uses the `shapley_spark_calculation`

above to figure out which features seem to work best for a random row of interest. Then it uses only the top 3 features and calculates accuracy again to see whether it is improved. After that it uses the last 3 features in the feature ranking, presumably those who perform worse, to see whether this is indeed true.

This is** a very naive and simple example** and since the dataframe consists of random values, **the accuracy and test error are around 0.5 — coin toss probability. **Which also means that **different runs yield different results **and while most of the times it works as expected and the Shapley feature ranking indeed provides better results, this is not always the case. This is just for demonstrative reasons on how to use the `calculate_shapley_values`

function. Also note that it currently uses the `prediction`

column,** but for IForest, it would be more beneficial to use the **`anomalyScore`

** column.**

Final output sample on the random dataset:

`Feature ranking by Shapley values:`

--------------------

#0. f4 = 0.0020181634712411706

#1. f3 = 0.0010090817356205853

#2. f1 = 0.0

#3. f2 = -0.0010090817356205853

#4. f0 = -0.006054490413723511

#5. f5 = -0.007063572149344097

*To be implemented**: use **rawPrediction or probabilities** to evaluate models.*

## CLI

Disclaimer:This work was based upon opensource work done for theBaskervilleproject foreQualitieas a Lead Developer on the project — an article for Baskerville will follow :). The Baskerville code was expanded to work for any pyspark model for the article’s purposes. You can find the Baskerville implementation here:

# Useful links

The python shap package has some very interesting examples in their docs:

Examples of how to use the python shap library can be found here:

The python shap library also has very beautiful visualizations.

# Notes

Part of the reason for writing this article here is** to get a review of the process**, in case I missed anything. So, please, feel free to send me your comments and thoughts about this. I’d love to know if there’s any better way of implementing the Shapley values calculation and if there is already something out there that does a better job.

Another reason is that we, in Baskerville, have **tested this in production and it has helped us a lot with having better results**. You can see Baskerville’s performance stats here:

https://deflect.ca/solutions/deflect-labs/baskerville/

Thanks and take care!