Fooling Partial Dependence via Data Poisoning
TL;DR: We highlight that Partial Dependence can be maliciously altered, e.g. bent and shifted, with adversarial data perturbations.
Blog based on the preprint available at https://arxiv.org/abs/2105.12837, which is a joint work with Wojtek Kretowicz and Przemyslaw Biecek. Code is available at https://github.com/MI2DataLab/fooling-partial-dependence.
Explainable machine learning gives many promises for developers and auditors working with black-box predictive models. Alarmingly, recent studies show that many explanations are not trustworthy and can be manipulated in an adversarial manner. Hence, it is necessary to focus on evaluating post-hoc explainability the same way we critically assume to evaluate model performance.
In the paper, we present techniques for attacking Partial Dependence (plots, profiles, PDP), which are among the most popular methods of explaining any predictive model trained on tabular data. This is especially crucial in financial or medical applications where auditability became a must-have trait supporting decisions made by black-boxes. Partial Dependence, like many other explanations, is a function of model and data, both of which can be considered as a single point of failure of an explanation. We poison the data to bend and shift explanations in the desired direction using genetic and gradient algorithms, while the model remains unchanged.
Figure 1 presents an example of fooling Partial Dependence of age in the SVM model prediction of a heart attack. We observe two contradictory explanations of variable dependence while only 2 out of 12 variables have their distribution shifted. These results raise awareness of the importance of context in model explanations, e.g. a reference data distribution.
The primary motivation of this work is highlighting the potential of fooling Partial Dependence to conceal the model’s suspected behaviour, which various stakeholders might raise as an issue in model audit (Figure 2).
Moreover, one can use the introduced algorithms to evaluate the explanations with respect to a given dataset and predictive task. Figure 3 presents two fooling strategies: targeted attack changes the dataset to achieve the closest explanation result to the predefined desired function (Dombrowski et al., 2019; Heo et al., 2019); robustness check aims to achieve the most distant model explanation from the original one by changing the dataset, which corresponds to the sanity check (Adebayo et al., 2018). Both allow to assess the explanations’ limitations.
In the paper, two optimization approaches are considered: genetic-based algorithm makes no assumptions about a black-box model, which makes it model-agnostic; gradient-based algorithm specifically designed for models with differentiable outputs, e.g. neural networks. Figure 4 presents another example of fooling Partial Dependence of sex in the SVM model prediction of a heart attack. One can consider both scenarios: manipulating the explanation to provide evidence of a suspected or concealed (unfair) model behaviour.
Overall, Partial Dependence is quite a complex function whose number of variables corresponds to data dimension NxP. This intuition entails a possibility of explanation manipulation via data poisoning, which should be considered in critical applications where proper evaluation is required for stable analysis.
We welcome feedback to this line of work. Find the algorithms’ details and a discussion on the results of experiments in the preprint on arXiv.
If you are interested in other posts about explainable, fair, and responsible ML, follow #ResponsibleML on Medium.