# Explaining the predictions— Shapley Values with PySpark

## Interpreting Isolation Forest’s predictions — and not only

Mar 20 · 10 min read

# The problem: how to interpret Isolation Forest’s predictions

The solution was to implement Shapley values’ estimation using Pyspark, based on the Shapley calculation algorithm described below.

# Estimating Shapley Values with Pyspark

## An approximation with Monte-Carlo sampling

First, select an instance of interest x, a feature j and the number of iterations M.

For each iteration, a random instance z is selected from the data and a random order of the features is generated.

Two new instances are created by combining values from the instance of interest x and the sample z.

The instance x+j is the instance of interest, but all values in the order before feature j are replaced by feature values from the sample z.

The instance x−j is the same as x+j, but in addition has feature j replaced by the value for feature j from the sample z.

The difference in the prediction from the black box is computed

All these differences are averaged and result in:

## The disadvantages of using Shapley values:

This article won’t focus on the ways to choose the right M, but it will provide a possible solution to use the whole dataset, or the greater part of it, for the calculation in a relatively reasonable time.

# Using Pyspark

## First of all, why Pyspark?

“… the iterations number should be large enough to accurately estimate the values, but small enough to complete the computation in a reasonable time … “

It only makes sense to use Spark to help spread the computation to different workers and make things as efficient as possible

## Ok then, how?

`# marginal contribution is calculated using a window and a lag of 1.# the window is partitioned by id because x+j and x-j for the same row# will have the same idx_df = x_df.withColumn(    'marginal_contribution',    (            F.col(column_to_examine) - F.lag(                F.col(column_to_examine), 1            ).over(Window.partitionBy('id').orderBy('id')    )    ))`

# How to use it

`Feature ranking by Shapley values:--------------------#0. f4 = 0.0020181634712411706#1. f3 = 0.0010090817356205853#2. f1 = 0.0#3. f2 = -0.0010090817356205853#4. f0 = -0.006054490413723511#5. f5 = -0.007063572149344097`

# Notes

## MLearning.ai

Data Scientists must think like an artist when finding a solution

Written by

## Maria Karanasou

A mom and a Software Engineer who loves to learn new things & all about ML & Big Data. Buy me a coffee to help me keep going buymeacoffee.com/mkaranasou

## MLearning.ai

Data Scientists must think like an artist when finding a solution, when creating a piece of code.Artists enjoy working on interesting problems, even if there is no obvious answer.

Written by

## Maria Karanasou

A mom and a Software Engineer who loves to learn new things & all about ML & Big Data. Buy me a coffee to help me keep going buymeacoffee.com/mkaranasou

## MLearning.ai

Data Scientists must think like an artist when finding a solution, when creating a piece of code.Artists enjoy working on interesting problems, even if there is no obvious answer.

## Introduction to AutoML: Using H2O Flow

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app