Noncompliance in Algorithmic Audits and Defending Auditors

Algorithmic audits can reveal harmful biases, but what happens when an auditee tries to hide its decision-making?

Zachariah Carmichael
13 min readApr 11, 2023
An artificial intelligence represented as a robot opens its hands to reveal a digital gavel.
Image courtesy of Flickr (CC BY-NC-ND 2.0)

Algorithms are omnipresent in our daily lives and automatically influence (or even make) our decisions. These decisions can range from being fairly inconsequential (e.g., recommending you a comedy, determining which route you drive, or automatically sorting your mail) to quite substantive (e.g., controlling the news you see, how much money you can borrow, or whether you get a job). When we see algorithms make mistakes, such as withholding welfare from people with disabilities or wrongfully arresting people, we naturally become curious as to why (or why not) some decision was made. In this article, I will discuss how algorithmic audits can address this problem and a new threat: auditees that can hide the decision-making process of their algorithm.

For those who are unfamiliar or want a refresher with black box algorithms, I highly recommend this article:

Algorithmic Audits

In the present day, algorithms make high-stakes decisions on our behalf. Unfortunately, there have been concerning trends in the real world: cut welfare for people with disabilities, synthetically created debt, wrongful imprisonment, and many more. Naturally, corporations have a special interest in risk mitigation and legal compliance. In addition, regulators and other sentinels need to ensure that corporations are compliant with applicable law. This all begs the question:

How can we audit black box algorithms?

Algorithmic audits can help achieve these goals and have received academic and commercial interest for several years now. However, there remain many challenges with such audits:

  • There is no canonical, agreed-upon definition of “algorithmic audit.” This leads to ambiguity, non-rigorous assessments, and the possibility of concerns being overlooked.
  • Some audits require direct access to the algorithm, not just its inputs and outputs. With current regulations, this cannot be guaranteed.
  • The results of some audits are not interpretable if the data is not interpretable.

Nonetheless, given concise and appropriate objectives and methodology, audits can prove useful in discovering biases and other concerns. In the following, I will define what I mean by algorithmic audit in terms of the inputs, outputs, applicable ordinance, and methodology.

  • Input: tabular data with human-comprehensible features. These features map to concepts that are either sensitive (protected by law, e.g., race or gender) or otherwise. Example: a loan application.
  • Output: a discrete, binary decision. Example: a loan application decision (approved or denied).
  • Ordinance: a definition of which input features are sensitive and how they may be used to make a decision. For simplicity, we say that no sensitive features can be used in a decision. Example: the ethnicity of a loan applicant cannot affect their loan decision.
  • Methodology: for each decision, we determine whether the sensitive input features were used. We do this by generating an explanation, which tells us how much each input feature contributed to the decision. We are interested in explanations where the sum of every feature contribution gives us the decision value (e.g., less than 0 if an application is denied and more than 0 otherwise). These local feature-additive explanations make it easy to determine compliance — if the contribution of any sensitive feature is not zero, then we know that the algorithm is noncompliant.

The following image visualizes this process for two different algorithms. In both cases, the loan application is denied for the individual, but only the second (bottom) algorithm is noncompliant. This is due to the ethnicity of the applicant (which is sensitive) being used to predominantly influence the decision as indicated by its contribution (as represented in the right-hand side box).

An example of an audit using an explainer for two algorithms.
An example of an audit using an explainer. The first auditee (top) is compliant, but the second (bottom) is noncompliant. Image by the author.

In practice, these explainers only have black-box access to the algorithm as we are not guaranteed access to the algorithm due to regulatory and proprietary reasons. In turn, we only have access to the algorithm inputs and outputs. To understand how the algorithm uses the inputs, we can vary the inputs and model the changes in the output. This variation process is called perturbation — I will explain its relevance and further details later.

The advantages of such an auditing scheme are that it is trustless, does not rely on subjective user reports, and does not require a company to divulge intellectual property to another party.

Adversarial Auditees and Noncompliance

A VW TDI vehicle that bypassed emissions regulation by fooling regulators.
Photo by Mariordo Mario Roberto Duran Ortiz — Own work, CC BY-SA 3.0. Source.

In the previous section, we saw that our explainer could uncover whether an algorithm uses sensitive features illegally. So—

Is the auditing problem solved?

No. There is the unfortunate possibility that the auditee is trying to hide its true decision-making process. To better understand this, we can look at the Volkswagen emissions scandal. In 2008, Volkswagen announced their new “Clean Diesel” TDI cars. However, they do not meet emission standards without significantly hindering fuel economy and performance. Rather than re-engineering the cars, Volkswagen modified the car software to detect if they are undergoing emissions testing. This involved determining if vehicle inputs were unusual, including the steering wheel position, vehicle speed, throttle position, duration of engine operation, and barometric pressure. Depending on these inputs, the car would toggle between a legal and illegal operation mode. It was not until 2015 that the US Environmental Protection Agency (EPA) discovered that the TDI nitrous oxide (NOx) emissions 40x higher than in testing. Thereafter, the company was sued, fined, and charged worldwide from 2015 to 2020.

The relevance of this scandal will become clear shortly.

How These Explainers Work

I mentioned earlier that the explainers we consider rely on perturbation to generate explanations. What does this mean and why is this done? Recall that we want to generate a local feature-additive explanation — generally, this can be done by fitting a linear model to a set of perturbations (data samples that are synthetically created) around a local region of a function. The following image provides a visualization of this process.

(Left) Linear models have global fidelity of linear functions. (Center) Linear models do not have global fidelity of nonlinear functions. (Right) Linear models can have local fidelity of nonlinear functions. Image by the author.

You should think of the true function as the algorithm that we are auditing — x1 and x2 are the input features and y is the algorithm decision. In each figure, we show a set of sampled points (the explainer-generated perturbations). The main takeaways are —

  • (Left) A linear model can be fitted to a linear function with high fidelity. Recall that we do not have access to the true function as the auditor. However, since the function is linear, our linear model yields a good result.
  • (Center) A linear model cannot fit nonlinear functions well. Qualitatively, we can see that the contour of our linear fit is drastically different than the contour of the true function. Most practical algorithms are (highly) nonlinear, so the assumption of linear does not hold in the real world.
  • (Right) A linear model can actually fit local regions of a nonlinear function reasonably well. This is because, in general, functions “appear” to be linear in a small enough neighborhood. We can see that the contours of the local neighborhood of the true function and the linear model are fairly close.

This exploitation of local linearity is exactly what is behind this class of explanation methods. Explainers that fall into this class include LIME and SHAP. The details of these specific algorithms are not important — we only care that they both generate data perturbations to explain black-box algorithm decisions.

Adversarial Exploitation

Akin to Volkswagen fooling emissions regulators, it is natural to wonder if a similar situation is possible in an algorithmic setting.

Can an algorithm under audit behave adversarially toward the auditor?

In 2019, it was first demonstrated that the explainers that we are considering are susceptible to an adversarial attack. The authors of the paper, “Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods,” noticed that the explainer-generated perturbations (that we discussed earlier) do not “look like” normal data points. Consider the following figure—

2D PCA manifold of the COMPAS data set. A clear separation between natural and perturbed samples can be seen.
Natural samples and explainer perturbations are projected onto the first two principal components (PCA) of the COMPAS recidivism data set. Image from “Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods.”

As we can see, there is a clear separation between points from the original data and the perturbations. Similar to Volkswagen detecting specific patterns on the driving controls, the authors of this work devise a way of detecting explainer-generated data points. Whereas Volkswagen used this information to reduce its emissions, the authors switch their algorithm to one that is legally compliant. This process is illustrated in the following figure—

An illustration of the “scaffolding attack.” An adversary can toggle between a legal and illegal algorithm (classifier) by detecting if the input data came from an auditor.
An illustration of the “scaffolding attack.” An adversary can toggle between a legal and illegal algorithm (classifier) by detecting if the input data came from an auditor. Image by the author.

As shown, an adversary can toggle between a legal and illegal algorithm (classifier) by detecting if the input data came from an auditor (i.e., is an explainer perturbation). This toggling is controlled by an out-of-distribution (OOD) detector. In this example, an applicant (top) is determined to be normal (in-distribution) due to their FICO, late payments, and income being jointly realistic. However, the second applicant (bottom) is determined to be anomalous (out-of-distribution) as they have a low FICO and many late payments for someone with such an income. By detecting enough of these out-of-distribution samples, the auditee can effectively manipulate the result of the audit by the explainer. The authors of the paper demonstrate that the attack works across several tabular data sets. I will refer to this approach as the “scaffolding attack” for the rest of the article.

There is a myriad of reasons why an auditee may want to obfuscate the behavior of their algorithm. This includes the desire to avoid bad PR (e.g., if the algorithm is prejudiced), to protect intellectual property, fiscal incentives, personal motivation, etc. Ultimately, it is to have their product or service be legal and not subject to scrutiny.

Defending the Auditor

Knowing that this scaffolding attack is a possibility, it is only natural to wonder if we can modify our explainer-based auditing approach to address this case.

How can we tell if the auditee is acting adversarially? Moreover, what can we do if we know this information?

This is the problem that our work has formalized and provides a solution to. To summarize:

  • We want to audit an algorithm using an explainer
  • The algorithm may or may not be lying to our explainer
  • We want to detect whether the algorithm is lying
  • To do so, we need a way of determining whether the algorithmic behavior changes depending on whether the input is normal (real) data or explainer-generated perturbations
While an adversarial auditee detects anomalous inputs, the auditor defense detects anomalous behavior of the algorithm under audit. Image by the author.

The various moving parts of this problem are shown above. While an adversarial auditee detects anomalous inputs, the auditor defense detects anomalous behavior of the algorithm under audit. Furthermore, while the presence of an anomaly informs the auditee which algorithm it should use (legal or illegal), it informs the auditor how an explanation should be formed.

Note that the detection on the audit defense side involves both the input data and the prediction — whereas the scaffolding attack employs anomaly detection, we propose to use conditional (or contextual) anomaly detection. To better understand the distinction, consider the following one-dimensional example—

Examples of an anomaly and a conditional anomaly. Image by the author.

We can say that the point marked on the left is anomalous due to there not being an income at that level before. In contrast, we have seen income in the range of the point marked as a conditional anomaly. However, we have never seen an income at that level with that loan risk level before. This is the main distinction between these two types of anomalies, and why we must consider the context of a decision to determine if there is adversarial behavior of an auditee.

To perform conditional anomaly detection, we propose a new algorithm, k-nearest neighbors conditional anomaly detection (KNN-CAD). If you want to understand the details, read on. Otherwise, you can skip to the results in the next section. KNN-CAD mainly comprises the following steps:

  1. Fit a KNN model on a set of samples representing normality (e.g., normal bank loan applications). Each point should be labeled with the prediction of the audited algorithm (e.g., the bank loan decision). This data will serve as a reference for comparing with new data points (explainer perturbations or other normal samples).
  2. For a given data point query (e.g., a new bank loan application and its decision), the KNN model is queried for the k nearest neighbors. These neighbors should be grouped according to their label (e.g., rejected or accepted).
  3. Next, the average distances between the queried point and neighbors with the same labels as the point and all labels are computed.
  4. The likelihood of the queried point can be expressed as the ratio between these two average distances.

Consider the following example on the toy iris data set.

The test/query point is an iris virginica. Of the k = 5 nearest neighbors, three of them are also Iris virginica and two of them are Iris setosa. In turn, KNN-CAD gives this point a likelihood of approximately 3/5.

Finally, conditional anomaly detection can be used to:

  1. Detect if there is adversarial behavior. In short, we can estimate if the empirical likelihood of explainer-generated perturbations is significantly lower than that of normal data. If so, it is quite likely that the audited algorithm behaves differently when an explainer is in use.
  2. Defend the explainer so that it is “unfooled.” Simply, we can remove explainer perturbations that have a low likelihood according to our conditional anomaly detector. This leaves us with a set of perturbations that the adversary cannot distinguish from normal samples.

While any conditional anomaly detector can be used in the detection and defense approaches enumerated above, we use KNN-CAD in our experiments as it is quite performant over other approaches, especially if the data contains categorical features. KNN-CAD is also preferable over deep learning alternatives as it is far more data-efficient.

Defense Results

We evaluate KNN-CAD with the defense approach on several real-world data sets and multiple explainers. The following figure shows an example of the efficacy of our methods.

A figure showing the efficacy of our approach. Read the caption for detailed explanation of the figure.
A comparison of our proposed defense to an unprotected explainer on the COMPAS data set. The auditee is trying to hide the fact that it relies on race in scoring recidivism risk. When an attack is employed, the adversary switches to a legal algorithm that relies on features that an uncorrelated with race. Our defense successfully detects and mitigates the attack. Image by the author.

The above figure demonstrates the effectiveness of our proposed defense for the LIME explainer on the COMPAS data set. In this experiment, the auditee is trying to hide the fact that it relies on race in scoring recidivism risk. When an attack is employed, the adversary switches to a legal algorithm that relies on features that an uncorrelated with race. When the attack is not deployed, LIME determines that race is the most influential feature for recidivism risk in the audited algorithm. However, when attacks are deployed (the two use different combinations of the uncorrelated features), LIME cannot reliably determine that race is influencing recidivism risk. When our defense approach is deployed, we can determine that the audited algorithm relies on race to score recidivism risk even when the attacks are in use.

Our work on this problem, entitled “Unfooling Perturbation-Based Post Hoc Explainers,” was published in the Proceedings of the 37th AAAI Conference on Artificial Intelligence (2023). If you would like to see further details, results, and analyses, I recommend you give it a look over. Our defense is shown to be effective for both LIME and SHAP across multiple real-world data sets, attack configurations, and against a multifaceted evaluation of fidelity.


While the proposed approach demonstrably mitigates the adversary, there are some limitations that we must discuss.

  • The explainers that we consider are limited. It is known that they can have issues with fidelity, the degree of which is affected by the explanation algorithm, the type of explanation (e.g., counterfactual, salience map, rule set, etc.), the type of data, and the class of predictor.
  • The adversary can deploy the attack irregularly or consider other pieces of information in detecting an auditor (e.g., IP addresses, the rate of queries, etc.). We do not consider such cases at the moment.
  • Gathering ground truth data for the defense can come at some cost. However, we demonstrate in the paper that our approach still works with very few samples.
  • Only the case of binary decisions is considered here. However, it is possible to extend the approach to any order of decision, as outlined in the paper.

A Call for Regulatory Development

Algorithmic regulation is still in its infancy and existing (and proposed) regulation is quite wanting. For instance, the EU GDPR is often claimed to guarantee a “right to explanation.” However, this only applied to data privacy and not automatic decision-making systems. The EU AI Act remains unclear as to how it may be enforced or implemented at the member state level. The US Algorithmic Accountability Act is argued to lack specificity and does not even apply to many organizations. Moreover, it is unclear how an adversarial auditee fits into current regulations.

I end this article with a few points on regulation:

  • We need clarity as to what is being audited, how existing law should be extended to consider algorithms, how such audits and law can be complied with, and how algorithmic regulation can be grounded in application-specific requirements.
  • Regulation should not be overburdening, hindering the development of this demonstrably transformative technology. For instance, some scholars have suggested that the National Environmental Policy Act (NEPA) should be followed. However, NEPA assessments average over four years to complete and four assessments took more than 17 years to complete between 2010 and 2017.
  • It has been argued that mandatory algorithmic auditing can lead to political meddling, which could be realized by partisan interpretations of ambiguous laws, the politicization of legitimate terms (e.g, “safety” or “security”), or other means.
  • Nonetheless, regulation is a largely missing component in automated decision-making systems and can lead to a reduction in harmful incidents. Audits are already employed to effectively address environmental concerns, safety, financial accountability, human rights issues, and more. Whether algorithmic regulation grows organically or through well-crafted proposals, it is clear that it is needed.

Further Reading

If you are interested in our work on this problem, our paper, “Unfooling Perturbation-Based Post Hoc Explainers,” was published in the Proceedings of the 37th AAAI Conference on Artificial Intelligence (2023).

If you have any questions or comments, feel free to leave a comment on this page or shoot me a message on my website at

I leave you with a picture of my cat.

My cat, Qbit. He’s the best cat.
My cat, Qbit. Image by the author.



Zachariah Carmichael

I make explainable and trustworthy AI for social betterment and scientific progress. PhD Candidate at the University of Notre Dame. Contact at