Witnessing Algorithms at Work: Toward a Typology of Audits

No one form of witnessing is sufficient to assemble accountability, but auditing can help make algorithmic systems accountable.

Jacob Metcalf

Published in

Data & Society: Points

9 min readAug 11, 2022

By Jacob Metcalf, Ranjit Singh, Emanuel Moss, and Elizabeth Anne Watkins

How do we trust algorithmic systems? This question has been a central thread that binds a tapestry of emerging conversations and research in the field of data science writ large. It is asked and answered differently by those involved in building, deploying, researching, and living with algorithmic systems. As one approach to answering it, there are increasing efforts to assess algorithmic systems’ effectiveness and potential consequences through internal and external audits. These audits have deep implications for legitimizing algorithmic systems as trustworthy, ethical, and accountable.

A longstanding practice within and beyond computing, auditing has become a core feature of the rapidly evolving field of algorithmic governance, and shares methods with social science audits rooted in social justice research. The methods and techniques used for such audits are, in effect, strategies of “algorithmic witnessing”: the process of creating conditions through which an audience can evaluate the workings of an algorithm. To witness an algorithmic system is to be able to understand the affordances and limits of its operation and to map the potential benefits and harms that it may produce.

Witnessing is deeply intertwined with trust

From the proverbial understanding of “seeing is believing,” we often rely on our own senses, lived experience, and expertise to trust, rather than on what somebody else tells us. To take one example of this common sense understanding of trust in action: in the 17th century, early experimental science relied on “gentlemen” — men of noble birth and virtuous character — as “reliable truth-tellers” in witnessing how scientific experiments were conducted. Empiricism, the philosophical commitment to grounding knowledge in only what our senses can directly experience, arose along with practices of formally documenting precise experiences. The results of an experiment and, by extension, the scientist who conducted it, could be trusted because they were witnessed by trustworthy and credible external observers. This strategy took on a new form as scientists began to describe their experimental setup in painstaking detail to ensure that anybody reading the description (and with adequate expertise) could reproduce the experiment and its results, thus allowing for “virtual witnessing” of their experimental results — and establishing scientific “objectivity” as equivalent to replicability by anyone, anywhere. Feminist historians of science have pointed out that constructing scientific practice around achieving a “view from nowhere” was not as simple as it seems, as the social practices of science often limited who could be included in efforts to attain objective knowledge. In response, feminist approaches to scientific practice tend to emphasize the inclusion of social and epistemic positions — aka “standpoints” — within the accounting of objective knowledge.

Thus, scientific practice and the value of the knowledge it generates has always been entwined with questions of whether a “witness” is trustworthy, and what types of accounting can become a proxy for witnessing ourselves. If we are to claim objectivity — and others are to trust our claims — certain markers of trust and accountability must be achieved.

An algorithmic audit follows a similar pattern: we are expected to trust or distrust algorithmic systems based on witness accounts of their workings, provided by either developers or auditors. In computational contexts, auditing is the practice of comparing the workings of a system or an organization’s software development process against a benchmark and judging whether variance between them falls within acceptable parameters, and is thus justified. That benchmark could be a technical description provided by the developer, an outcome prescribed in a contract, a procedure defined by a standards organization such as IEEE or ISO, commonly accepted best practices, or a regulatory mandate. In the nascent algorithmic auditing industry, the standards used by auditors can be highly variable. Ideally, trustworthy audits are performed by experts with the capacity to render such judgment, and with a degree of independence. Much like “scientific witnessing,” the value of audits therefore rests not just in how the audit is conducted, but in who does the auditing and how they relate to other parties in an accountability structure.

A typology of audits

Grounded in the relationship between auditors and developers, we have devised a typology of these audits — first-, second-, and third-party — that helps explore and explain how they each produce trust.

First-party audits are internally conducted by developers who build algorithmic systems, akin to a scientist writing detailed descriptions of the experiment they performed. These audits are conducted due to a range of concerns, from common elements of responsible AI practice like transparency and fairness to utilitarian reasons, such as hedging against disparate impact lawsuits. They are structurally closest to the actual design of an algorithmic system and its specifications, and thus may afford a deeper understanding of its workings. Andrew Selbst and Solon Barocas point out that the core challenge of governing and trusting algorithmic systems is not explaining how a model works, but why the model was designed to work that way. Internal audit can serve this purpose of asking and documenting why, and introduces opportunities to reflect on the proper balance between end goals, core values, and technical trade-offs. As Inioluwa Deborah Raji et al. have argued: “At a minimum, the internal audit process should enable critical reflections on the potential impact of a system, serving as internal education and training on ethical awareness in addition to leaving what we refer to as a ‘transparency trail’ of documentation at each step of the development cycle.”

The issue of creating a transparency trail for algorithmic systems is a non-trivial problem: machine learning models tend to shed their ethically-relevant context. Each step in the technical stack (layers of software that are “stacked” to produce a model in a coordinated workflow), from datasets to a deployed model, results in ever more abstraction from the context of data collection. Technical research has developed documentation methods that retain ethically-relevant context throughout the development process. For example, Timnit Gebru et al. proposed “datasheets for datasets,” a form of documentation that could travel with datasets as they are reused and repurposed for new purposes. Similarly, Margaret Mitchell et al. have described a documentation process as “model cards for model reporting” that retains information about benchmarked evaluations of the model in relevant domains of use, excluded uses, and factors for evaluation, among other details. However, no matter how thorough and well-meaning these developers-as-internal-auditors are, such reporting mechanisms are not yet “accountable.” Internal auditing as a form of algorithmic witnessing structurally lacks the authority of an external audience to ensure the validity of the claims made through, and methods used in, the audit. No matter how qualified the auditor or how thorough the audit, it is structurally limited to informing the developer about the concerns of the developer alone.

Second-party audits are conducted by external consultants hired by developers. Structurally more distant from the algorithm developers, this form of algorithmic witnessing offers a middleground where the internal dynamics of system design can still be accessible to auditors. They constitute a kind of external audience that can offer a restricted form of assurance and validation. A good example here is the second-party ethics audit of the Allegheny Family Screening Tool (AFST) — an algorithmic system used to assist child welfare call screening — arguably the most thoroughly audited algorithmic system in use by a public agency in the US. This audit was centered on the question of whether implementing AFST is likely to create the best outcomes among available alternatives, including proceeding with the status quo without any predictive services.

Since a second-party auditor is primarily responsive to the developers who hired them, and in many cases may not be able to share proprietary information relevant to the public interest, second-party audits can often proceed without public consultation or public access to the auditing process. In AFST’s case, for example, the second party auditors assumed that the interventions based on the tool will be experienced primarily as supportive, rather than punitive, in evaluating the risk of algorithmic bias toward non-white families. But in a third-party audit conducted by Virginia Eubanks, there is evidence to the contrary. Eubanks revealed that many vulnerable families experienced a perverse incentive to forgo voluntary, proactive support so as to avoid creating another contact with service providers, which in turn increased their risk scores.

Second party audits become even more ethically complicated when developers determine the scope of the audit. Mona Sloane has pointed out that these “collaborative audits” can set a dangerous precedent that lets developers set the standards their products should be held to. As an example, she points to an audit conducted collaboratively with Pymetrics, a company that uses gameplay to assess job applicants. This audit interrogated the Pymetrics system for racial and gender bias, but left unexamined the claim that gameplay performance was predictive of job performance. If the core claims of a system are not interrogated as part of the audit — that is, if the product doesn’t work as advertised — then an audit demonstrating that members of different protected classes are not differentially affected by a system is only as effective as a rubber-stamp. It could be the case that such gameplay does indeed measure job performance, but the parameters of the audit doesn’t include that core question because the developer lacks an incentive to ask, and therefore is unlikely to ask the second-party auditors to do so. Proving that a system doesn’t produce one kind of harm shouldn’t be a green light for a technology that might lead to other types of harm — in this case, the exclusion of worthwhile job applicants based on their skill at an unrelated task.

Second-party audits are already quite common in the computing industry, but they are typically conducted against a standard that is widely accepted. For example, software development lifecycle standards define common terminology, documentation, and tasks needed to control the functionality of software; such process standards provide the background against which a second-party auditor could assure both a client and vendor that a piece of software is likely to be reliable. The field of algorithmic auditing largely lacks such widely accepted standards, although increasingly there are efforts to create them.

Third-party audits are conducted by independent auditors with no formal relationship with, or structural independence from, developers. They often do not have access to the internal workings of an algorithmic system or its design specifications, but use their own techniques as algorithmic witnesses by evaluating system performance through carefully manipulating inputs and assessing outputs. Third-party audits are often adversarial, intended to bring an undesirable outcome to public attention (though an adversarial stance is not essential).

Adversarial audits have been a primary driver of public attention to algorithmic harms, and a motivating force for the development of first-party audits. Notable examples include ProPublica’s analysis of the Northpointe COMPAS recidivism prediction algorithm (led by Julia Angwin), the Gender Shades project’s analysis of race and gender bias in the facial recognition APIs offered by multiple companies (led by Joy Buolamwini), and Virginia Eubanks’ account of algorithmic decision systems employed by social service agencies. In each of these cases, external experts analyzed algorithmic systems primarily through the outputs of deployed systems, without access to the backend controls or models. This is the core feature of adversarial third-party algorithmic audits: the auditor lacks access to the backend controls and design records of the system and therefore is limited to understanding the outputs of these blackboxed systems. Without such access, a third-party who is conducting quantitative analysis needs to rely on records of how the system operates in the field, from the epistemic position of external observer rather than engineer.

Their externality can also become a weakness: when the targets of these audits have engaged in rebuttals, for example in response to ProPublica’s analysis or in response to Virginia Eubanks’ portrayal of AFST, their technical analyses have invoked knowledge of the systems’ design parameters that an adversarial third-party auditor could not have had access to. The reliance on such technical analyses in response to audits that point out socio-political harms all too often falls into the trap of the specification dilemma, which prioritizes technical explanations for why a system might function as intended, while ignoring that accurate results might themselves be the source of harm. An inaccurate match made by a facial recognition system, for example, may not by itself be an algorithmic harm, but the exclusionary consequences that flow from such misrecognition certainly meet the definition of algorithmic harms.

Third party audits have brought to light how little the public knows about the actual workings of algorithmic systems that render major decisions about our lives through algorithmic prediction and classification, but they cannot alone constitute accountability. Nor can any kind of audit: None of these forms of algorithmic witnessing are sufficient by themselves to assemble accountability, but together they are necessary to make algorithmic systems accountable. True accountability emerges only when an audit is embedded in a larger ecosystem of regulatory checks and balances — one that also specifies what should happen when things go wrong.

Witnessing Algorithms at Work: Toward a Typology of Audits

No one form of witnessing is sufficient to assemble accountability, but auditing can help make algorithmic systems accountable.

Written by Jacob Metcalf