On Recent Research Auditing Commercial Facial Analysis Technology

Concerned researchers

Over the past few months, there has been increased public concern over the accuracy and use of new face recognition systems. A recent study conducted by Inioluwa Deborah Raji and Joy Buolamwini, published at the AAAI/ACM conference on Artificial Intelligence, Ethics, and Society, found that the version of Amazon’s Rekognition tool which was available on August 2018, has much higher error rates while classifying the gender of darker skinned women than lighter skinned men (31% vs. 0%). In response, two Amazon officials, Matthew Wood and Michael Punke, wrote a series of blog posts attempting to refute the results of the study [1, 2]. In this piece we highlight several important facts reinforcing the importance of the study and discussing the manner in which Wood and Punke’s blog posts misrepresented the technical details for the work and the state-of-the-art in facial analysis and face recognition.

  1. There is an indirect or direct relationship between modern facial analysis and face recognition (depending on the approach). So in contrast to Dr. Wood’s claims, bias found in one system is cause for concern in the other, particularly in use cases that could severely impact people’s lives, such as law enforcement applications.
  2. Raji and Buolamwini’s study was conducted within the context of Rekognition’s use. This means using an API that was publicly available at the time of the study, considering the societal context under which it was being used (law enforcement), and the amount of documentation, standards and regulation in place at the time of use.
  3. The data used in the study can be obtained through a request to https://www.ajlunited.org/gender-shades for non commercial uses, and has been replicated by many companies based on the details provided in the paper available at http://gendershades.org/.
  4. There are no laws or required standards to ensure that Rekognition is used in a manner that does not infringe on civil liberties.

We call on Amazon to stop selling Rekognition to law enforcement

A study published by Inioluwa Deborah Raji and Joy Buolamwini in the AAAI/ACM conference on Artificial Intelligence, Ethics, and Society examined the extent to which public pressure on companies helped address the bias in their products [3]. The study showed that companies audited in the Gender Shades project [4] (Microsoft, IBM and Face++) greatly improved their gender classification systems.¹ For women of color, Amazon and Kairos had errors of approximately 31% and 22% respectively on the task of gender classification: determining whether a person in an image is “male” or “female.” Notably, current gender classification methods use only a “male” and “female” binary — non-binary genders are not represented in these systems. While we do not condone gender classification, it is important to recognize the role the Raji & Buolamwini work has played in highlighting the poor state of the art on current facial analysis technology, and the lack of insight companies have into this problem.

In response, Amazon Web Services’ (AWS) general manager of artificial intelligence, Matthew Wood, and vice president of global public policy Michael Punke attempted to refute the research by calling it “misleading” and drawing “false conclusions”. While others have refuted some of the points raised by Dr. Wood and Mr. Punke [5], we would like to discuss three points made by the authors which are particularly concerning.

1. Facial analysis vs. recognition — Facial analysis and recognition have an important relationship and the racial and gender bias in the Raji & Buolamwini audit is cause for concern.

Dr. Wood writes:

This statement is problematic on multiple fronts. First, despite Dr. Wood’s claim, “face recognition” and “facial analysis” are closely related, and it is common for machine learning researchers to categorize “recognition” as a type of analysis [6, 7]. For example, Escalera et al. [8] define all aspects of facial analysis from a computer vision perspective as:

Including, but not limited to: recognition, detection, alignment, reconstruction of faces, pose estimation of faces, gaze analysis, age, emotion, gender, and facial attributes estimation, and applications among others.

To help better contextualize the potential for confusion and the relationship between facial analysis, face recognition, and related technologies, it is useful to take a step back and understand the overall landscape of current research in human-centric computer vision tasks for images. We provide a basic (but not exhaustive) breakdown in the table below.

Human-Centric Computer Vision Landscape for Single Images

Image for post
Image for post
Example human-centric vision tasks for single images.

Beyond this common categorization in facial analysis research, well-known methods for face recognition are based on other aspects of facial analysis, and vice versa: Faces may be detected before they are recognized [9], and face recognition datasets may be used as pre-training for other facial analysis tasks [10, 11].

Out of all of the tasks categorized as face recognition or facial analysis, classifying faces into two binary gender options, as performed by Amazon and others’ gender classification APIs, is technically simplistic (without accounting for the social complexity). That is because it involves training a model to classify an image into 1 out of only 2 categories: “male” or “female”. Thus, this task is particularly well suited for analyzing and demonstrating some of the biases that exist in human-centric computer vision tools. Hence, recent works have used it to inform the public, governments, and companies about the risks of using such tools in high stakes scenarios where there are currently no required standards, transparent governance, or federal regulation in place.

Second, even if a researcher were to frame facial analysis and face recognition as different tasks, Dr. Wood’s statement also incorrectly implies that biases in one task should not be cause for concern on potential biases in the other — especially when considering products used in law enforcement. Despite slightly different outputs, it is possible for the gender classification task Raji and Buolamwini examined in their study to use the same base models and identical training data [10, 12] as a face recognition task that could be used by law enforcement.

Raji and Buolamwini also made no claim to “gauging” accuracy, but instead, sought to highlight the need for rigorous intersectional analysis of commercial face recognition systems. One of the key lessons from Buolamwini’s prior work is the importance of disaggregated evaluation, breaking down the evaluation of a model’s performance along different categories. As she shows, it is not enough to simply evaluate on single categories: Looking at combinations, such as gender and skin type, exposes weaknesses in human-centric vision technology that have largely been overlooked.

Due in large part to this influential work, many industries and disciplines have started to deeply question and examine errors that may fall disproportionately on some subpopulations. For example, Dr. Dhruv Khullar, a physician at New York-Presbyterian Hospital and an assistant professor in the Weill Cornell Department of Healthcare Policy and Research, wrote a New York Times Op-ed [13] discussing the ways in which AI could worsen disparities in healthcare, citing Gender Shades and asking “What happens when we rely on such algorithms to diagnose melanoma on light versus dark skin?”

So while Dr. Wood is correct that some people frame facial analysis and face recognition as different tasks, he misrepresents their close relationship and how demonstrated biases in one motivate concern about biases in another. Caution, concern, and rigorous evaluation — sensitive to the intersecting demographics that affect human-centric computer vision for images — are even more pressing when considering products that are used in scenarios that severely impact people’s lives, such as law enforcement.

2. Societal Context Matters — Raji and Buolamwini’s paper investigates Amazon’s products within the context and society they are used.

The research paper in question does not use the recommended facial recognition capabilities, does not share the confidence levels used in their research, and we have not been able to reproduce the results of the study.‎

And

To understand, interpret, and compare the accuracy of machine learning systems, it’s important to understand what is being predicted, the confidence of the prediction, and how the prediction is to be used, which is impossible to glean from a single absolute number or score.

It is important to test systems like Amazon’s Rekognition in the real world, in ways that it is likely to be used. This includes “black box” scenarios, where users do not interact with inner details of the system such as the model, training data, or settings. Products like Amazon’s Rekognition are embedded in sociotechnical systems. As described by Selbst et al. [14]:

The field of Science and Technology Studies (STS) describes systems that consist of a combination of technical and social components as “sociotechnical systems.” Both humans and machines are necessary in order to make any technology work as intended.

They should be studied in the context of the society they are used in, the knowledge and motivations of the operators, the standards and documentations available to them, and the types of mechanisms in place to prevent misuse. Currently, many of these pieces are either not in place or inadequate. According to Gizmodo [15], one of the few known Rekognition customers has mentioned inadequate training, and does not utilize the 99% confidence score that Dr. Wood mentions is recommended by Amazon.

[W]hen asked by Gizmodo if they adhere to Amazon’s guidelines where this strict confidence threshold is concerned, the WCSO Public Information Officer (PIO) replied, “We do not set nor do we utilize a confidence threshold”….The PIO further informed Gizmodo that, while Amazon did supply documentation and other support on the software end, no direct training was given to the investigators who continue to use the suite.

In the Perpetual Line-Up [16], Clare Garvie, Alvaro Bedoya and Jonathan Frankle of the Center on Privacy & Technology at Georgetown Law, a think tank that studies law enforcement’s use of face recognition, discuss the real-world implications of misuse of face recognition tools: that the wrong people may be put on trial due to cases of mistaken identity. They highlight that many times law enforcement operators do not know the parameters of these tools, nor how to interpret some of their results. Decisions from such automated tools may also seem more correct than they actually are, a phenomenon known as “automation bias”, or may prematurely limit human-driven critical analyses. In his New York Times Op Ed, Dr. Khullar writes:

In my practice, I’ve often seen how any tool can quickly become a crutch — an excuse to outsource decision making to someone or something else. Medical students struggling to interpret an EKG inevitably peek at the computer-generated output at the top of the sheet. I myself am often swayed by the report provided alongside a chest X-ray or CT scan. As automation becomes pervasive, will we catch that spell-check autocorrected “they’re” to “there” when we meant “their”?

We do not currently have a mechanism to catch the types of errors mentioned by Dr. Khullar.

Dr. Wood continues:

And, to date (over two years after releasing the service), we have had no reported law enforcement misuses of Amazon Rekognition.

There are currently no laws in place to audit Rekognition’s use, Amazon has not disclosed who the customers are, nor what the error rates are across different intersectional demographics. How can we then ensure that this tool is not improperly being used as Dr. Wood states? What we can rely on are audits by independent researchers, such as Raji and Buolamwini, with concrete numbers and clearly designed, explained, and presented experimentation, that demonstrates the types of biases that exist in these products. This critical work rightly raises the alarm on using such immature technologies in high stakes scenarios without a public debate and legislation in place to ensure that civil rights are not infringed.

3. Companies such as IBM and Microsoft have reproduced comparable data and results using the guidelines written in the Gender Shades work which reiterates that ethnicity cannot be used as a proxy for skin type while performing disaggregated testing.

These groups have refused to make their training data and testing parameters publicly available.

The Gender Shades project website makes the data publicly available with certain licensing agreements, and those who could not agree to the terms (researchers in companies such as IBM and Microsoft) have instead reproduced comparable data and results using the guidelines written in the paper.

Dr. Wood writes that Amazon did not see any disparities by ethnicity on the task of gender classification, while performing a test on a dataset of 12,000 images. As also discussed in Gender Shades, ethnicity and race are social constructs that are unstable across time and space. This is why the authors instead partnered with a dermatologist to label images using the Fitzpatrick skin-type classification system. Dr. Wood’s lack of recognition of this important distinction while discrediting peer reviewed research that undeniably motivates caution is disconcerting.

We call on Amazon to stop selling Rekognition to law enforcement as legislation and safeguards to prevent misuse are not in place.

¹Researchers such as Os Keyes and Morgan Klaus Scheuerman have written about the potential harms of automatic gender recognition for transgender communities, given that systems which use automated gender recognition can range from the humiliating (serving an ad in public which implicitly misgenders the individual) to the violent (rejecting identification or entry to a public bathroom).

Concerned Researchers

  1. Noura Al Moubayed, Durham University
  2. Miguel Alonso Jr, Florida International University
  3. Anima Anandkumar, Caltech (formerly Principal Scientist at AWS)
  4. Akilesh Badrinaaraayanan, MILA/University of Montreal
  5. Esube Bekele, National Research Council fellow
  6. Yoshua Bengio, MILA/University of Montreal
  7. Alex Berg, UNC Chapel Hill
  8. Miles Brundage, OpenAI; Oxford; Axon AI Ethics Board
  9. Dan Calacci, Massachusetts Institute of Technology
  10. Pablo Samuel Castro, Google
  11. Stayce Cavanaugh, Google
  12. Abir Das, IIT Kharagpur
  13. Hal Daumé III, Microsoft Research and University of Maryland
  14. Maria De-Arteaga, Carnegie Mellon University
  15. Mostafa Dehghani, University of Amsterdam
  16. Emily Denton, Google
  17. Lucio Dery, Facebook AI Research
  18. Priya Donti, Carnegie Mellon University
  19. Hamid Eghbal-zadeh, Johannes Kepler University Linz
  20. El Mahdi El Mhamdi, Ecole Polytechnique Fédérale de Lausanne
  21. Paul Feigelfeld, IFK Vienna, Strelka Institute
  22. Jessica Finocchiaro, University of Colorado Boulder
  23. Andrea Frome, Google
  24. Field Garthwaite, IRIS.TV
  25. Timnit Gebru, Google
  26. Sebastian Gehrmann, Harvard University
  27. Oguzhan Gencoglu, Top Data Science
  28. Marzyeh Ghassemi, University of Toronto, Vector Institute
  29. Georgia Gkioxari, Facebook AI Research
  30. Alvin Grissom II, Ursinus College
  31. Sergio Guadarrama, Google
  32. Alex Hanna, Google
  33. Bernease Herman, University of Washington
  34. William Isaac, Deep Mind
  35. Phillip Isola, Massachusetts Institute of Technology
  36. Alexia Jolicoeur-Martineau, MILA/University of Montreal
  37. Yannis Kalantidis, Facebook AI
  38. Khimya Khetarpal, MILA/McGill University
  39. Michael Kim, Stanford University
  40. Morgan Klaus Scheuerman, University of Colorado Boulder
  41. Hugo Larochelle, Google/MILA
  42. Erik Learned-Miller, UMass Amherst
  43. Xing Han Lu, McGill University
  44. Kristian Lum, Human Rights Data Analysis Group
  45. Michael Madaio, Carnegie Mellon University
  46. Tegan Maharaj, Mila/École Polytechnique
  47. João Martins, Carnegie Mellon University
  48. El Mahdi El Mhamdi, Ecole Polytechnique Fédérale de Lausanne
  49. Vincent Michalski, MILA/University of Montreal
  50. Margaret Mitchell, Google
  51. Melanie Mitchell, Portland State University and Santa Fe Institute
  52. Ioannis Mitliagkas, MILA/University of Montreal
  53. Bhaskar Mitra, Microsoft and University College London
  54. Jamie Morgenstern, Georgia Institute of Technology
  55. Bikalpa Neupane, Pennsylvania State University, UP
  56. Ifeoma Nwogu, Rochester Institute of Technology
  57. Vicente Ordonez-Roman, University of Virginia
  58. Pedro O. Pinheiro
  59. Vinodkumar Prabhakaran, Google
  60. Parisa Rashidi, University of Florida
  61. Anna Rohrbach, UC Berkeley
  62. Daniel Roy, University of Toronto
  63. Negar Rostamzadeh
  64. Kate Saenko, Boston University
  65. Niloufar Salehi, UC Berkeley
  66. Anirban Santara, IIT Kharagpur (Google PhD Fellow)
  67. Brigit Schroeder, Intel AI Lab
  68. Laura Sevilla-Lara, University of Edinburgh
  69. Shagun Sodhani, MILA/University of Montreal
  70. Biplav Srivastava
  71. Luke Stark, Microsoft Research Montreal
  72. Rachel Thomas, fast.ai; University of San Francisco
  73. Briana Vecchione, Cornell University
  74. Toby Walsh, UNSW Sydney
  75. Serena Yeung, Harvard University
  76. Yassine Yousfi, Binghamton University
  77. Richard Zemel, Vector & University of Toronto

Click below to add your name and affiliation

References

[2] https://aws.amazon.com/blogs/machine-learning/some-thoughts-on-facial-recognition-legislation/

[3] http://www.aies-conference.com/wp-content/uploads/2019/01/AIES-19_paper_223.pdf

[4] http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf

[5] https://civic.mit.edu/2019/02/07/the-narrative-of-public-criticism/

[6] https://arxiv.org/pdf/1611.00851.pdf

[7] https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Zhu_Face_Alignment_Across_CVPR_2016_paper.pdf

[8] https://hal.inria.fr/hal-01991654/document

[9]https://www.robots.ox.ac.uk/~vgg/publications/2015/Parkhi15/parkhi15.pdf

[10] https://arxiv.org/pdf/1607.06997.pdf

[11]https://talhassner.github.io/home/projects/cnn_emotions/LeviHassner_ICMI15.pdf

[12] https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Masi_Pose-Aware_Face_Recognition_CVPR_2016_paper.pdf

[13] https://www.nytimes.com/2019/01/31/opinion/ai-bias-healthcare.html

[14] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3265913

[15] https://gizmodo.com/defense-of-amazons-face-recognition-tool-undermined-by-1832238149

[16] https://www.perpetuallineup.org/

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store