Explaining the predictions— Shapley Values with PySpark

Interpreting Isolation Forest’s predictions — and not only

Maria Karanasou
Mar 20 · 10 min read

The problem: how to interpret Isolation Forest’s predictions

The solution was to implement Shapley values’ estimation using Pyspark, based on the Shapley calculation algorithm described below.

What are the Shapley values?

Estimating Shapley Values with Pyspark

An approximation with Monte-Carlo sampling

First, select an instance of interest x, a feature j and the number of iterations M.

For each iteration, a random instance z is selected from the data and a random order of the features is generated.

Two new instances are created by combining values from the instance of interest x and the sample z.

The instance x+j is the instance of interest, but all values in the order before feature j are replaced by feature values from the sample z.

The instance x−j is the same as x+j, but in addition has feature j replaced by the value for feature j from the sample z.

The difference in the prediction from the black box is computed

source

All these differences are averaged and result in:

source

The disadvantages of using Shapley values:

This article won’t focus on the ways to choose the right M, but it will provide a possible solution to use the whole dataset, or the greater part of it, for the calculation in a relatively reasonable time.

Using Pyspark

First of all, why Pyspark?

“… the iterations number should be large enough to accurately estimate the values, but small enough to complete the computation in a reasonable time … “

It only makes sense to use Spark to help spread the computation to different workers and make things as efficient as possible

Ok then, how?

Example of the initial dataframe with a feature permutations column (created by Author)

Let’s start with j = F1.

How xj (x-j and x+j) is calculated (source: Author)
Example of output of the dataframe with the xj column calculated and the exploded xj afterwards. Notice the same ids for every two rows.
From the dataframe df with xj colum to the exploded Xdf (source: Author)
# marginal contribution is calculated using a window and a lag of 1.
# the window is partitioned by id because x+j and x-j for the same row
# will have the same id
x_df = x_df.withColumn(
'marginal_contribution',
(
F.col(column_to_examine) - F.lag(
F.col(column_to_examine), 1
).over(Window.partitionBy('id').orderBy('id')
)
)
)
Output of the marginal contribution for each z calculation
Marginal Contribution calculation (source: Author)

The complete code

Shapley Values estimation with PySpark

How to use it

Feature ranking by Shapley values:
--------------------
#0. f4 = 0.0020181634712411706
#1. f3 = 0.0010090817356205853
#2. f1 = 0.0
#3. f2 = -0.0010090817356205853
#4. f0 = -0.006054490413723511
#5. f5 = -0.007063572149344097
Example of how to use the shapley calculation function

CLI

CLI example

Useful links

Notes

MLearning.ai

Data Scientists must think like an artist when finding a solution

Maria Karanasou

Written by

A mom and a Software Engineer who loves to learn new things & all about ML & Big Data. Buy me a coffee to help me keep going buymeacoffee.com/mkaranasou

MLearning.ai

Data Scientists must think like an artist when finding a solution, when creating a piece of code.Artists enjoy working on interesting problems, even if there is no obvious answer.

Maria Karanasou

Written by

A mom and a Software Engineer who loves to learn new things & all about ML & Big Data. Buy me a coffee to help me keep going buymeacoffee.com/mkaranasou

MLearning.ai

Data Scientists must think like an artist when finding a solution, when creating a piece of code.Artists enjoy working on interesting problems, even if there is no obvious answer.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store