The problem: how to interpret Isolation Forest’s predictions
More specifically, how to tell which features are contributing more to the predictions. Since Isolation Forest is not a typical Decision Tree (see Isolation Forest characteristics here), after some research, I ended up with three possible solutions:
1) Train on the same dataset another similar algorithm that has feature importance implemented and is more easily interpretable, like Random Forest.
2) Reconstruct the trees as a graph for example. The most important features should be the ones on the shortest paths of the trees. This is because of how Isolation Forest works: the anomalies are few and distinct, so it should be easier to single them out — in fewer steps, thus, shortest paths.
3) Estimate the Shapley values: the marginal feature contribution, which is a more standard way of identifying feature importance ranking.
Options 1 and 2 were not deemed as the best solutions to the problem, mainly because of the difficulties in how to pick a good algorithm — Random Forest, for example, operates differently than Isolation Forest, so it wouldn’t be very wise to use its feature importance output. Additionally, option 2 does not generalize and it would be like trying to do the algorithm’s work all over again.
The solution was to implement Shapley values’ estimation using Pyspark, based on the Shapley calculation algorithm described below.
The implementation takes a trained pyspark model, the spark dataframe with the features, the row to examine, the feature names, the features column name and the column name to examine, e.g. prediction. The output is the ranking of features, from the most to the least helpful.
What are the Shapley values?
The Shapley value provides a principled way to explain the predictions of nonlinear models common in the field of machine learning. By interpreting a model trained on a set of features as a value function on a coalition of players, Shapley values provide a natural way to compute which features contribute to a prediction. This unifies several other methods including Locally Interpretable Model-Agnostic Explanations (LIME), DeepLIFT, and Layer-Wise Relevance Propagation.
Or as the python shap package states:
A game theoretic approach to explain the output of any machine learning model.
In other words, it is the average of the marginal contributions across all feature permutations. The following article has a good way of explaining what the marginal contribution is:
Explain Your Model with the SHAP Values
Use the SHAP Values to Explain Any Complex ML Model
You can find the relevant code here.
From the article below with an example from game theory: “finding each player’s marginal contribution, averaged over every possible sequence in which the players could have been added to the group”. Thus, finding each feature’s marginal contribution, averaged over every possible sequence in which the features could have been selected. I’ll provide an algorithm and a visual example shortly :)
One Feature Attribution Method to (Supposedly) Rule Them All: Shapley Values
Among the papers that caught my eye at NIPS was this one, that made the lofty claim to have discovered a framework…
Estimating Shapley Values with Pyspark
An approximation with Monte-Carlo sampling
First, select an instance of interest x, a feature j and the number of iterations M.
For each iteration, a random instance z is selected from the data and a random order of the features is generated.
Two new instances are created by combining values from the instance of interest x and the sample z.
The instance x+j is the instance of interest, but all values in the order before feature j are replaced by feature values from the sample z.
The instance x−j is the same as x+j, but in addition has feature j replaced by the value for feature j from the sample z.
The difference in the prediction from the black box is computed
All these differences are averaged and result in:
5.9 Shapley Values | Interpretable Machine Learning
A prediction can be explained by assuming that each feature value of the instance is a "player" in a game where the…
The advantages of using Shapley values:
- “The difference between the prediction and the average prediction is fairly distributed among the feature values of the instance. […] In situations where the law requires explainability — like EU’s “right to explanations” — the Shapley value might be the only legally compliant method, because it is based on a solid theory and distributes the effects fairly.” . This is a very important advantage I believe.
- You can use a sample of the dataset to compare the Shapley values.
- The Shapley value is the only explanation method with a solid theory
The disadvantages of using Shapley values:
- As one can imagine, the calculation of Shapley values requires a lot of computing time. Decreasing the number of iterations (M) reduces computation time, but increases the variance of the Shapley value. It makes sense that the iterations number should be large enough to accurately estimate the values, but small enough to complete the computation in a reasonable time. This is one of the main reasons for a scalable solution.
This article won’t focus on the ways to choose the right M, but it will provide a possible solution to use the whole dataset, or the greater part of it, for the calculation in a relatively reasonable time.
- You need both the model and the data (feature vectors) to calculate them.
- It can be misinterpreted. This is one issue I struggled with myself. “The Shapley value of a feature value is not the difference of the predicted value after removing the feature from the model training.” Instead, one should view the values as the contribution of a feature value to the difference between the actual prediction and the mean prediction.
- It is not suitable for sparse solutions —meaning sparse vectors with few features within a sample. All the features are needed.
First of all, why Pyspark?
Because of the following statement:
“… the iterations number should be large enough to accurately estimate the values, but small enough to complete the computation in a reasonable time … “
Estimating Shapley Values requires a lot of computing time and if you are using Spark models, there’s a good chance your dataset ranges from big-ish to huge. So, it only makes sense to use Spark to help spread the computation to different workers and make things as efficient as possible. I mean, sure we’ll be using a udf for some part of the calculation, but even so, this will run on different workers, so in parallel.
It only makes sense to use Spark to help spread the computation to different workers and make things as efficient as possible
Ok then, how?
To calculate the Shapley values for all features following the algorithm description above using pyspark, the algorithm below was used:
Let’s start with a dataframe that has an ID column and features F1, F2, F3, F4.
The ID doesn’t have to be sequential but it has to be unique and can be generated with the ways described here. The reason for the ID column is because we need a
row of interest so we need to identify this somehow within the dataset. We can also use a
is_selected column (as you’ll notice in the code), but let’s go on with the explanation of the basic algorithm first.
The features can reside in one column e.g.
features or in separate, it doesn’t really matter. In our example, let’s have them separate so that it is easier to see the values.
M consists of all the rows, each of the row is a
z , so we have
z1, z2 ... zM and let’s randomly choose
zM-3, the fourth row from the end to be our row of interest
x is filtered out as it doesn’t make sense to calculate
xj for it
Since we are going through all
M rows, we need a feature permutation for each. The following function does exactly that. Given an array of feature names, and a dataframe df, it creates a column
features_permutation with a feature permutation (list of features in some order) for each row.
Once we get the permutations we can start iterating the features to calculate
x-j . Because we’ll be using a udf for this, we’ll be calculating both in one udf under one column
xj for performance reasons (udfs can be expensive in pyspark).
Let’s start with j = F1.
The udf needs the name of the selected feature,
feature j, the values of
z1, and the feature permutation that corresponds to
z1 . It also needs the ordered feature names and the row of interest values,
x. The last two can be broadcasted.
So, for the following z1, x, feature permutation [F2, F3, F1, F4], the xj for j=F1 is calculated as seen below:
The udf, as seen below, practically iterates the current feature permutation from where
j=F1 is, so
[F2, F3, F1, F4] in our example becomes
x+j = [x_F1, Z1_F2, Z1_F3, X_F4] and
x-j = [Z1_F1, Z1_F2, Z1_F3, X_F4].
A list with two elements, a dense vector for
x-j and a dense vector for
x+j is returned.
One important note here is that the output of the udf is a list containing the
x-j and the
x+j vectors in this order. This is important because of the way we are going to be calculating the marginal contribution, using Spark’s function
lag , which we’ll see in a minute.
Once we have
xj calculated, we can explode
xj(which means the df will double in number of rows), so that we have a features column that contains either
x+j. Rows that belong to the same
z have the same id in the exploded df, which is why we can use lag over id next, to calculate the marginal contribution for each
We use the model to predict on
And next we calculate each
z marginal contribution:
# marginal contribution is calculated using a window and a lag of 1.
# the window is partitioned by id because x+j and x-j for the same row
# will have the same id
x_df = x_df.withColumn(
F.col(column_to_examine) - F.lag(
The complete code
How to use it
The following code generates a random dataset of 6 features,
F1, F2, F3, F4, F5, F6 , with labels
[0, 1] and uses the
shapley_spark_calculation above to figure out which features seem to work best for a random row of interest. Then it uses only the top 3 features and calculates accuracy again to see whether it is improved. After that it uses the last 3 features in the feature ranking, presumably those who perform worse, to see whether this is indeed true.
This is a very naive and simple example and since the dataframe consists of random values, the accuracy and test error are around 0.5 — coin toss probability. Which also means that different runs yield different results and while most of the times it works as expected and the Shapley feature ranking indeed provides better results, this is not always the case. This is just for demonstrative reasons on how to use the
calculate_shapley_values function. Also note that it currently uses the
prediction column, but for IForest, it would be more beneficial to use the
Final output sample on the random dataset:
Feature ranking by Shapley values:
#0. f4 = 0.0020181634712411706
#1. f3 = 0.0010090817356205853
#2. f1 = 0.0
#3. f2 = -0.0010090817356205853
#4. f0 = -0.006054490413723511
#5. f5 = -0.007063572149344097
To be implemented: use
rawPrediction or probabilities to evaluate models.
Disclaimer: This work was based upon opensource work done for the Baskerville project for eQualitie as a Lead Developer on the project — an article for Baskerville will follow :). The Baskerville code was expanded to work for any pyspark model for the article’s purposes. You can find the Baskerville implementation here:
Security Analytics Engine — Anomaly Detection in Web Traffic — equalitie/baskerville
Security Analytics Engine — Anomaly Detection in Web Traffic — equalitie/baskerville
The python shap package has some very interesting examples in their docs:
An introduction to explainable AI with Shapley values — SHAP latest documentation
The gray horizontal line in the plot above represents the expected value of the model when applied to the boston…
SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model…
Examples of how to use the python shap library can be found here:
Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources
The python shap library also has very beautiful visualizations.
Isolation Forest and Spark
Main characteristics and ways to use Isolation Forest in PySpark
Part of the reason for writing this article here is to get a review of the process, in case I missed anything. So, please, feel free to send me your comments and thoughts about this. I’d love to know if there’s any better way of implementing the Shapley values calculation and if there is already something out there that does a better job.
Another reason is that we, in Baskerville, have tested this in production and it has helped us a lot with having better results. You can see Baskerville’s performance stats here:
Thanks and take care!