On Recent Research Auditing Commercial Facial Analysis Technology

12 min readMar 26, 2019

Concerned researchers

Over the past few months, there has been increased public concern over the accuracy and use of new face recognition systems. A recent study conducted by Inioluwa Deborah Raji and Joy Buolamwini, published at the AAAI/ACM conference on Artificial Intelligence, Ethics, and Society, found that the version of Amazon’s Rekognition tool which was available on August 2018, has much higher error rates while classifying the gender of darker skinned women than lighter skinned men (31% vs. 0%). In response, two Amazon officials, Matthew Wood and Michael Punke, wrote a series of blog posts attempting to refute the results of the study [1, 2]. In this piece we highlight several important facts reinforcing the importance of the study and discussing the manner in which Wood and Punke’s blog posts misrepresented the technical details for the work and the state-of-the-art in facial analysis and face recognition.

There is an indirect or direct relationship between modern facial analysis and face recognition (depending on the approach). So in contrast to Dr. Wood’s claims, bias found in one system is cause for concern in the other, particularly in use cases that could severely impact people’s lives, such as law enforcement applications.
Raji and Buolamwini’s study was conducted within the context of Rekognition’s use. This means using an API that was publicly available at the time of the study, considering the societal context under which it was being used (law enforcement), and the amount of documentation, standards and regulation in place at the time of use.
The data used in the study can be obtained through a request to https://www.ajlunited.org/gender-shades for non commercial uses, and has been replicated by many companies based on the details provided in the paper available at http://gendershades.org/.
There are no laws or required standards to ensure that Rekognition is used in a manner that does not infringe on civil liberties.

We call on Amazon to stop selling Rekognition to law enforcement

A study published by Inioluwa Deborah Raji and Joy Buolamwini in the AAAI/ACM conference on Artificial Intelligence, Ethics, and Society examined the extent to which public pressure on companies helped address the bias in their products [3]. The study showed that companies audited in the Gender Shades project [4] (Microsoft, IBM and Face++) greatly improved their gender classification systems.¹ For women of color, Amazon and Kairos had errors of approximately 31% and 22% respectively on the task of gender classification: determining whether a person in an image is “male” or “female.” Notably, current gender classification methods use only a “male” and “female” binary — non-binary genders are not represented in these systems. While we do not condone gender classification, it is important to recognize the role the Raji & Buolamwini work has played in highlighting the poor state of the art on current facial analysis technology, and the lack of insight companies have into this problem.

In response, Amazon Web Services’ (AWS) general manager of artificial intelligence, Matthew Wood, and vice president of global public policy Michael Punke attempted to refute the research by calling it “misleading” and drawing “false conclusions”. While others have refuted some of the points raised by Dr. Wood and Mr. Punke [5], we would like to discuss three points made by the authors which are particularly concerning.

1. Facial analysis vs. recognition — Facial analysis and recognition have an important relationship and the racial and gender bias in the Raji & Buolamwini audit is cause for concern.

Dr. Wood writes:

Facial analysis and facial recognition are completely different in terms of the underlying technology and the data used to train them. Trying to use facial analysis to gauge the accuracy of facial recognition is ill-advised, as it’s not the intended algorithm for that‎ purpose.

This statement is problematic on multiple fronts. First, despite Dr. Wood’s claim, “face recognition” and “facial analysis” are closely related, and it is common for machine learning researchers to categorize “recognition” as a type of analysis [6, 7]. For example, Escalera et al. [8] define all aspects of facial analysis from a computer vision perspective as:

Including, but not limited to: recognition, detection, alignment, reconstruction of faces, pose estimation of faces, gaze analysis, age, emotion, gender, and facial attributes estimation, and applications among others.

To help better contextualize the potential for confusion and the relationship between facial analysis, face recognition, and related technologies, it is useful to take a step back and understand the overall landscape of current research in human-centric computer vision tasks for images. We provide a basic (but not exhaustive) breakdown in the table below.

Human-Centric Computer Vision Landscape for Single Images

Example human-centric vision tasks for single images.

Beyond this common categorization in facial analysis research, well-known methods for face recognition are based on other aspects of facial analysis, and vice versa: Faces may be detected before they are recognized [9], and face recognition datasets may be used as pre-training for other facial analysis tasks [10, 11].

Out of all of the tasks categorized as face recognition or facial analysis, classifying faces into two binary gender options, as performed by Amazon and others’ gender classification APIs, is technically simplistic (without accounting for the social complexity). That is because it involves training a model to classify an image into 1 out of only 2 categories: “male” or “female”. Thus, this task is particularly well suited for analyzing and demonstrating some of the biases that exist in human-centric computer vision tools. Hence, recent works have used it to inform the public, governments, and companies about the risks of using such tools in high stakes scenarios where there are currently no required standards, transparent governance, or federal regulation in place.

Second, even if a researcher were to frame facial analysis and face recognition as different tasks, Dr. Wood’s statement also incorrectly implies that biases in one task should not be cause for concern on potential biases in the other — especially when considering products used in law enforcement. Despite slightly different outputs, it is possible for the gender classification task Raji and Buolamwini examined in their study to use the same base models and identical training data [10, 12] as a face recognition task that could be used by law enforcement.

Raji and Buolamwini also made no claim to “gauging” accuracy, but instead, sought to highlight the need for rigorous intersectional analysis of commercial face recognition systems. One of the key lessons from Buolamwini’s prior work is the importance of disaggregated evaluation, breaking down the evaluation of a model’s performance along different categories. As she shows, it is not enough to simply evaluate on single categories: Looking at combinations, such as gender and skin type, exposes weaknesses in human-centric vision technology that have largely been overlooked.

Due in large part to this influential work, many industries and disciplines have started to deeply question and examine errors that may fall disproportionately on some subpopulations. For example, Dr. Dhruv Khullar, a physician at New York-Presbyterian Hospital and an assistant professor in the Weill Cornell Department of Healthcare Policy and Research, wrote a New York Times Op-ed [13] discussing the ways in which AI could worsen disparities in healthcare, citing Gender Shades and asking “What happens when we rely on such algorithms to diagnose melanoma on light versus dark skin?”

So while Dr. Wood is correct that some people frame facial analysis and face recognition as different tasks, he misrepresents their close relationship and how demonstrated biases in one motivate concern about biases in another. Caution, concern, and rigorous evaluation — sensitive to the intersecting demographics that affect human-centric computer vision for images — are even more pressing when considering products that are used in scenarios that severely impact people’s lives, such as law enforcement.

2. Societal Context Matters — Raji and Buolamwini’s paper investigates Amazon’s products within the context and society they are used.

Dr. Wood and Mr. Punke repeatedly mention that the study did not test Amazon’s product in a manner it was intended to be used. E.g.,

The research paper in question does not use the recommended facial recognition capabilities, does not share the confidence levels used in their research, and we have not been able to reproduce the results of the study.‎

And

To understand, interpret, and compare the accuracy of machine learning systems, it’s important to understand what is being predicted, the confidence of the prediction, and how the prediction is to be used, which is impossible to glean from a single absolute number or score.

It is important to test systems like Amazon’s Rekognition in the real world, in ways that it is likely to be used. This includes “black box” scenarios, where users do not interact with inner details of the system such as the model, training data, or settings. Products like Amazon’s Rekognition are embedded in sociotechnical systems. As described by Selbst et al. [14]:

The field of Science and Technology Studies (STS) describes systems that consist of a combination of technical and social components as “sociotechnical systems.” Both humans and machines are necessary in order to make any technology work as intended.

They should be studied in the context of the society they are used in, the knowledge and motivations of the operators, the standards and documentations available to them, and the types of mechanisms in place to prevent misuse. Currently, many of these pieces are either not in place or inadequate. According to Gizmodo [15], one of the few known Rekognition customers has mentioned inadequate training, and does not utilize the 99% confidence score that Dr. Wood mentions is recommended by Amazon.

[W]hen asked by Gizmodo if they adhere to Amazon’s guidelines where this strict confidence threshold is concerned, the WCSO Public Information Officer (PIO) replied, “We do not set nor do we utilize a confidence threshold”….The PIO further informed Gizmodo that, while Amazon did supply documentation and other support on the software end, no direct training was given to the investigators who continue to use the suite.

In the Perpetual Line-Up [16], Clare Garvie, Alvaro Bedoya and Jonathan Frankle of the Center on Privacy & Technology at Georgetown Law, a think tank that studies law enforcement’s use of face recognition, discuss the real-world implications of misuse of face recognition tools: that the wrong people may be put on trial due to cases of mistaken identity. They highlight that many times law enforcement operators do not know the parameters of these tools, nor how to interpret some of their results. Decisions from such automated tools may also seem more correct than they actually are, a phenomenon known as “automation bias”, or may prematurely limit human-driven critical analyses. In his New York Times Op Ed, Dr. Khullar writes:

In my practice, I’ve often seen how any tool can quickly become a crutch — an excuse to outsource decision making to someone or something else. Medical students struggling to interpret an EKG inevitably peek at the computer-generated output at the top of the sheet. I myself am often swayed by the report provided alongside a chest X-ray or CT scan. As automation becomes pervasive, will we catch that spell-check autocorrected “they’re” to “there” when we meant “their”?

We do not currently have a mechanism to catch the types of errors mentioned by Dr. Khullar.

Dr. Wood continues:

And, to date (over two years after releasing the service), we have had no reported law enforcement misuses of Amazon Rekognition.

There are currently no laws in place to audit Rekognition’s use, Amazon has not disclosed who the customers are, nor what the error rates are across different intersectional demographics. How can we then ensure that this tool is not improperly being used as Dr. Wood states? What we can rely on are audits by independent researchers, such as Raji and Buolamwini, with concrete numbers and clearly designed, explained, and presented experimentation, that demonstrates the types of biases that exist in these products. This critical work rightly raises the alarm on using such immature technologies in high stakes scenarios without a public debate and legislation in place to ensure that civil rights are not infringed.

3. Companies such as IBM and Microsoft have reproduced comparable data and results using the guidelines written in the Gender Shades work which reiterates that ethnicity cannot be used as a proxy for skin type while performing disaggregated testing.

Finally, Mr. Punke writes that

These groups have refused to make their training data and testing parameters publicly available.

The Gender Shades project website makes the data publicly available with certain licensing agreements, and those who could not agree to the terms (researchers in companies such as IBM and Microsoft) have instead reproduced comparable data and results using the guidelines written in the paper.

Dr. Wood writes that Amazon did not see any disparities by ethnicity on the task of gender classification, while performing a test on a dataset of 12,000 images. As also discussed in Gender Shades, ethnicity and race are social constructs that are unstable across time and space. This is why the authors instead partnered with a dermatologist to label images using the Fitzpatrick skin-type classification system. Dr. Wood’s lack of recognition of this important distinction while discrediting peer reviewed research that undeniably motivates caution is disconcerting.

We call on Amazon to stop selling Rekognition to law enforcement as legislation and safeguards to prevent misuse are not in place.

Overall, we find Dr. Wood and Mr. Punke’s response to the peer-reviewed research findings disappointing. We hope that the company will instead thoroughly examine all of its products and question whether they should currently be used by police. Mr. Punke writes that the company supports legislation to ensure that its products are not used in a manner that infringes civil liberties. We call on Amazon to stop selling Rekognition to law enforcement as such legislation and safeguards are not in place.

¹Researchers such as Os Keyes and Morgan Klaus Scheuerman have written about the potential harms of automatic gender recognition for transgender communities, given that systems which use automated gender recognition can range from the humiliating (serving an ad in public which implicitly misgenders the individual) to the violent (rejecting identification or entry to a public bathroom).