Manipulating explainability and fairness in machine learning

Hubert Baniecki
ResponsibleML
Published in
4 min readOct 25, 2021

toward trustworthy AI..

This blog is a follow up to the well-received text on Adversarial attacks on Explainable AI based on a live survey of related work available on GitHub, which is open for contributions. I am excited to be presenting this topic at the ML in PL Conference 2021.

ML in PL Conference 2021.

Introduction

As explanations and fairness methods became widely adopted in various machine learning applications, a crucial discussion was raised on their validity. Precise measures and evaluation approaches are required for a trustworthy adoption of explainable machine learning techniques (Vilone & Longo, 2021). It should be as obvious as evaluating machine learning models, especially when working with various stakeholders (Barredo-Arrieta et al., 2019).

A careless adoption of these methods becomes irresponsible. Historically, adversarial attacks exploited machine learning models in diverse ways; hence, defense mechanisms were proposed, e.g. using model explanations (Liu et al., 2021). Nowadays, ways of manipulating explainability and fairness in machine learning have become more evident. They might be used to achieve adversary, but more so to highlight the explanations’ shortcomings and the need for evaluation.

Simultaneously at top-tier conferences throughout 2018, several works of Ancona et al., Adebayo et al., Alvarez-Melis & Jaakkola; Kindermans et al.*, Ghorbani et al.** proposed novel ways of investigating feature attribution explanations, which historically were considered state-of-the-art, especially in computer vision. These studies raised crucial awareness among machine learning practitioners that explaining models does not entail trust in AI.

I have previously discussed exemplary ways of manipulating explainability via changing a model or data. How about fairness?

Manipulating fairness via model change

Aivodji et al. (2019) propose a LaundryML framework for crafting a fair surrogate model that approximates an unfair black-box model. In practice, a classification rule list approximating a random forest is considered. Authors evaluate the surrogate model’s performance by balancing:

  1. fidelity: accuracy of predicting the same outcomes as the black-box,
  2. unfairness: demographic parity measure of bias with respect to the sensitive attribute, e.g. gender or race,
  3. complexity: a number of rules in a list that serves as a regularization.

This leads to the formulation of the following algorithm objective:

Loss ~ (1 − β) * (1 − fidelity) + β * unfairness + λ * complexity

optimizied with the LaundryML algorithm described in detail in the paper.

Aivodji et al. “Fairwashing: the risk of rationalization” ICML 2019

Experiments are conducted on two well-known datasets: Adult Income considering gender bias and ProPublica Recidivism considering race bias. The above results show that the risk of fairwashing black-box models is real, meaning one could propose a fairer surrogate model of high fidelity. Thus, machine learning developers should be considerate in interpreting models.

Manipulating fairness via data change

In contrast, Fukuchi et al. (2020) propose a stealthy biased sampling algorithm, which aims to find a subset of the dataset on which the fairness metrics detect no model bias. In practice, the algorithm aims to sample a fair dataset while minimizing a Wasserstein distance (WD) between the original data distribution and the changed one, which is used to measure the indistinguishability of data distributions. Demographic parity (DP) is used to measure fairness; a large DP indicates that the decision is unfair because it favours predicting a given class in one group.

Fukuchi et al. “Faking Fairness via Stealthily Biased Sampling” AAAI 2020

In fact, experiments using conventional models are conducted on the same datasets, as ProPublica Recidivism task is often abbreviated as COMPAS (the recidivism scoring model) and vise-versa. The above results show that faking fairness without changing the model is possible, and we should carefully analyse the estimated measures in the context of data distribution.

There are more examples of manipulating explainability and fairness that support the claim to reconsider evaluation approaches toward trustworthy AI. Should you have further propositions of related works in the domain of adversarial and explainable machine learning, please consider reaching out or contributing to the list on GitHub.

* rejected at ICLR’18 to be later published in a book ’19
** rejected at ICLR’18 to be later published at AAAI’19

--

--