Be careful! Some model explanations can be fooled.

Wojtek Kretowicz
ResponsibleML
Published in
4 min readJan 8, 2021

Attack on LIME and SHAP explanations.

Image from Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. It presents the percentage of data points that occurred as the top 3 most important according to LIME and SHAP methods respectively.

Black boxes often can achieve better performance than their older brothers: statistical models. However, nothing is for free. Here the cost is hidden in the explainability. Although we can have a detailed insight into a machine learning model as we want, its inner complexity makes interpretation of its decision processes practically impossible.

This is the area where Explainable Artificial Intelligence (XAI) comes in. It provides lots of methods that can approximate black boxes’ decision processes. Everything is great as long as we do not realize that XAI methods are black boxes themselves and they are rather complex in nature which can open a door for malicious attacks. Attacks are methods that change the explanations and often conclusions. Such attacks may change the data set or may change the model to achieve their goals.

Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods published in 2019 describes attack on LIME and SHAP, while for example Explanations can be manipulated and geometry is to blame introduces attacks on Computer Vision. We will focus on the first one.

LIME and SHAP

LIME and SHAP are local explanations called feature attribution methods. They associate numerical importance to each variable used to make a particular decision. You can read about them in other ResponsibleML blogs: BASIC XAI with DALEX — Part 5: Shapley values and BASIC XAI with DALEX — Part 6: LIME method

Intuition

However, from the perspective of the attack, they are just local approximations of the black box that use the neighborhood of the explained data point to find proper parameters. They take the input data set and perturb it to obtain this neighborhood. Some of the perturbated points may lie in the dense areas next to many original input data points, but some of them may lie in the empty regions. Even a model with very good performance can have very unexpected, almost random behavior in such regions — because it did not have any information about it during the training. This is where attacks on LIME and SHAP will take advantage.

The image taken from Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods article. This picture presents how perturbated samples may lie out of the distribution.

Attack

Let’s say someone has a model f that is somehow biased, for example, racist. He or she would like to attack/fool the explanation to hide this behavior.

The idea here is to create another model e that will behave very similar to the original model f when we are in the distribution and at the same time will fool LIME or SHAP explanation.

The authors proposed a framework with the creation of two other models: g and is_ood. Model g will not have the same flaw as f. Classifier is_ood will tell us if we are currently in the distribution or out of the distribution. It is fitted by perturbating the input data set, merging it with the input data set, and labeling original samples as 0 and perturbated ones as 1.

Then final adversarial model e is constructed as follows:

e(x) = f(x) if not is_ood(x) else g(x)

Experimental results

Here is the example from the original article. The authors used the COMPAS data set. This data set contains information about criminal records. The goal is to predict the chances of becoming a recidivist. The race is a sensitive feature. We would like to have a model that does not use information about it to make a prediction.

Setup

To present the attack they constructed a biased — racist model f that uses information only about the race. They used a random forest with 100 trees as the is_ood classifier. Two unbiased classifiers were fitted. First on one artificial uncorrelated with target feature, second on two such features.

Result

Before the attack, classifier f is uncovered by both LIME and SHAP methods. However, after the creation of the adversarial classifier e using one artificial feature, LIME and SHAP tell something completely different. They tell us that the most important feature in making the prediction by model e is the uncorrelated artificial feature and usage of race is rather marginal.

Using 2 uncorrelated artificial features resulted in equal importance association and neglecting sensitive feature in case of LIME attack. In the case of the SHAP attack, the race was still quite significant but much less than before.

Image from Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. It presents the percentage of data points that occurred as the 3 most important according to LIME and SHAP methods respectively.

Conclusions

Although XAI methods give us huge insight into black boxes, they are black boxes themselves. Thus you should use them with some reflection and understanding. Maybe the model you are validating was attacked? Maybe it was not attacked, but it is vulnerable and its explanations are not especially stable?

However, these attacks are method-specific. It seems that it is rather hard to fool many explanations at the same time. Even if one of them was attacked, perhaps others will tell you the truth.

It is definitely a good idea to use many XAI methods at the same time.

If you are interested in other posts about explainable, fair, and responsible ML, follow #ResponsibleML on Medium.

--

--