Response: Racial and Gender bias in Amazon Rekognition — Commercial AI System for Analyzing Faces.
Update: “ Twenty-six researchers, including Yoshua Bengio, a recent winner of the Turing Award, the industry’s highest honor, called for Amazon to stop selling its Rekognition AI service to police departments in a post on Wednesday. Bengio was joined by Anima Anandkumar, a former principal scientist at Amazon’s cloud division, and staffers from Google, Microsoft Corp., Facebook Inc. and several universities.
The group defended the work of two other AI researchers who found Amazon’s software had much higher error rates when predicting the gender of darker-skinned women in images, compared with lighter-skinned men. Amazon had argued against the results and methodology of that study, authored by the University of Toronto’s Inioluwa Deborah Raji and Joy Buolamwini, a researcher at Massachusetts Institute of Technology.” — Bloomberg
In our recent study of bias in commercial facial analysis systems, Deborah Raji and I show Amazon Rekognition, an AI service the company sells to law enforcement, exhibits gender and racial bias for gender classification. The New York Times broke the story. Unlike its peers, Amazon did not submit their AI systems to the National Institute of Standards and Technology(NIST) for the latest rounds of facial recognition evaluations. Their claims of being bias free are based on internal evaluations. This is why we did an external evaluation to provide an outside perspective. Despite receiving preliminary reports of gender and racial bias in a June 25, 2018 letter (we did more comprehensive testing in August 2018 after preliminary indication of bias), Amazon’s approach thus far has been one of denial, deflection, and delay. We cannot rely on Amazon to police itself or provide unregulated and unproven technology to police or government agencies.
In this article, I address points raised by Matt Wood, general manager of artificial intelligence at Amazon Web Services, about our study. I also address other criticisms of the study that have been made by those with interest in keeping the use, abuse, and technical immaturity of AI systems in the dark.
1. Regardless of accuracy, AI tools like Amazon Rekognition used to analyze human faces can be abused.
Before diving into the technical points raised by Matt Wood on behalf of Amazon, we have to keep in mind that the AI services the company provides to law enforcement and other customers can be abused regardless of accuracy.
Among the most concerning uses of facial analysis technology involve the bolstering of mass surveillance, the weaponization of AI, and harmful discrimination in law enforcement contexts. The technology needs oversight and regulation. Because this powerful technology is being rapidly developed and adopted without oversight, the Algorithmic Justice League and the Center on Privacy & Technology launched the Safe Face Pledge. The pledge prohibits lethal use of any kind of facial analysis technology including facial recognition and aims to mitigate abuses. I urge Amazon and all peers to sign on as three companies have already done.
As I write in an earlier post about Amazon’s FML — Fail Machine Learning:
“Both accurate and inaccurate use of facial analysis technology to identify a specific individual (facial recognition) or assess an attribute about a person (gender classification or ethnic classification) can lead to violations of civil liberties.
Inaccuracies in facial recognition technology can result in an innocent person being misidentified as a criminal and subjected to unwarranted police scrutiny. This is not a hypothetical situation. Big Brother Watch UK released the Face-Off report highlighting false positive match rates of over 90% for facial recognition technology deployed by the Metropolitan police. According to the same report, two innocent women were matched with men in Scotland Yard. During the summer [of 2018], UK Press shared the story of a young black man misidentified by facial recognition technology and humiliated in public. The organization is now pursuing legal action against the lawless use of facial recognition in the UK.
Even if these tools reach some accuracy thresholds, they can still be abused and enlisted to create a camera-ready surveillance state. Facial analysis technology can be developed to not only recognize an individual’s unique biometric signature but can also learn soft biometrics like age and gender. Facial analysis technology that can somewhat accurately determine demographic or phenotypic attributes can be used to profile individuals, leaving certain groups more vulnerable for unjustified stops.” — Amazon’s Symptoms of Failed Machine learning.
2. Facial Analysis, Facial Detection, Facial Recognition, Gender Classification, Attribute Classification… what do these terms mean and how are they related?
Matt Wood states the results in our Actionable Auditing study “are based on facial analysis and not facial recognition (analysis can spot faces in videos or images and assign generic attributes such as wearing glasses; recognition is a different technique by which an individual face is matched to faces in videos and images).”
To clarify, let’s first define facial analysis technology. I define facial analysis technology as any kind of technology that analyzes human faces. Automated facial recognition is a type of facial analysis technology. There are different ways to analyze faces which leads to a number of definitions for the various types of analysis.
Facial analysis technology attempts to address some combination of the following questions:
Facial Detection — Is there a face?
Before AI tools are used determine the identity of a face or the expression of an individual in an image or video, a face must first be detected. Without this basic step you cannot proceed any further. Facial detection systems have repeatedly been shown to fail on the faces of dark-skinned individuals. In a TED featured talk I gave, I share my own experience of coding in a white mask just to have my face detected. You can read more about why facial detection systems have high failure rates on people of color in this post Algorithms aren’t racist, your skin is just too dark.
The failure to even detect faces of color in the first place has been a major problem for studies around facial analysis technology, because often these studies are based on results on faces that were detected.
Facial Attribute Classification —What kind of face is this?
Assuming a face is detected in the first place, AI tools can then be applied to guess emotional attributes based on the expression of a face or demographic attributes about a face like gender, ethnicity, or age.
Affect recognition is one way of describing facial attribute classification that deals with facial expressions and inferred emotions. The 2018 AI Now Annual Report provides an excellent analysis of the dangers of inferring internal states from external expressions.
Controversial research studies have claimed to assess attributes like sexuality or criminality from just an image of a person’s face. Just because a company makes a claim about being able to determine attributes about an individual based on her face doesn’t make it true, appropriate, or even scientifically valid.
Our studies on facial analysis technology sold by companies like Amazon have focused on binary gender classification to provide just one example of how facial analysis technology can be biased. The main message is to check all systems that analyze human faces for any kind of bias. If you sell one system that has been shown to have bias on human faces, it is doubtful your other face-based products are also completely bias free (more on this latter). Again, like its peers Amazon should submit its models both the new ones and the legacy ones still being used by customers to the national benchmarks for evaluating accuracy of systems that analyze human faces.
Facial Recognition — What is the identity of a face/Has the AI system analyzed this face before?
When we talk about facial recognition, we are not concerned with demographic attributes or expression, but the identity of an individual. Facial recognition comes in two flavors. Some facial recognition is used to perform tasks like unlocking a phone or getting access to a bank account. This is known as facial verification. Does the face being analyzed match the face expected? Facial identification, the other flavor of facial recognition, involves trying to match a face to a person of interest in an existing database of faces. This is the kind of technology that can be used to try and identify a missing person or criminal. Generally when we hear about police using facial recognition technology, the technology is performing the task of facial identification. Unchecked, widely deployed facial identification technology can lead to mass surveillance and unprecedented risks to civil liberties.
The terminology used in the field is not always consistent, and you might see terms like “face recognition” or “facial recognition” being used interchangeably. Often times companies like Amazon provide AI services that analyze faces in a number of ways offering features like labeling the gender or providing identification services. All of these systems regardless of what you call them need to be continuously checked for harmful bias.
3. Accuracy is always relative. Existing measures for accuracy can give us a false sense of progress in the face space because many key benchmark datasets are biased.
As I have conducted studies on commercial AI systems, we find time and time again that the internal accuracy rates if reported by companies seem to be at odds with external accuracy rates reported by independent third parties. What accounts for these differences?
We have to keep in mind that when we talk about accuracy for a facial analysis technology, accuracy is determined by using a set of face images or videos collected for testing these systems. These images form a benchmark. And not all benchmarks are created equal.
Many benchmarks that are used to test these systems I discovered are composed of mainly male and of lighter-skinned faces. With technologies like machine learning –which learn patterns of things like faces from data, data is destiny, and we are destined to fail the rest of the world if we rely on pale male datasets to benchmark our AI systems.
Amazon’s Matt Wood states the company used a large benchmark of over 1 million faces to test their facial recognition capabilities and performed well. While their performance on the benchmark might seem laudable, we do not know the detailed demographic or phenotypic(skin type) composition of this benchmark. Without this information we cannot asses for racial, gender, color, or other kinds of bias.
We learned this lesson from Facebook not too long ago. Back in 2014, computer vision researchers gained a false sense of universal progress with facial recognition technology when Facebook announced 97% accuracy on the gold standard benchmark of the day called Labeled Faces in the Wild. When researchers looked at this gold standard dataset, they found that the dataset was about 77% male and contained over 80% white individuals.
Because of the proliferation of pale male datasets, I decided to develop the Pilot Parliaments Benchmark to create a better balanced dataset.
Inclusive Benchmarks Matter
To develop the benchmark I got a list of the top 10 countries in the world by their representation of women in parliament. Rwanda leads the world with over 60%. And well represented are Nordic countries and a handful of other African countries. I decided to choose 3 African countries and 3 European counties to help balance on skin type by having faces with some of the lightest individuals and also very dark-skinned members. It is with this better balanced dataset that I evaluated Amazon’s Rekognition tool along with it competitors.
The first study to use this benchmark was my MIT master’s thesis— Gender Shades. I looked at how gender classification systems from leading tech companies performed across a range of skin types and genders. All systems performed better on male faces than female faces overall, and all systems performed better on lighter-skinned faces than darker-skinned faces overall. Error rates were as high as 35% for darker-skinned women, 12% for darker-skinned men, 7% for lighter-skinned women, and no more than 1% for lighter-skinned men.
Our new study uses this same benchmark and applies it to AI services from both Amazon and Kairos while reevaluating Microsoft, IBM, and Face++. We found Amazon and Kairos to have flawless performance on white men in our August 2018 study. In the same study, Amazon had a low accuracy rate of 68.6% on women of color. Keep in mind that our benchmark is not very challenging. We have profile images of people looking straight into a camera. Real-world conditions are much harder. Doing poorly on any aspect of the Pilot Parliaments Benchmark raises red flags. Doing well on the benchmark is akin to not tripping at the beginning of the race.
Who chooses the settings?
But even on the same benchmark, accuracy numbers for facial analysis systems can differ. As we know AI systems are not perfect. When analyzing faces there is a certain level of uncertainty about the predictions that are made. To account for this uncertainty, confidence scores are sometimes used. For example in the image below, we see that Amazon provides a gender label of male for this image of Oprah Winfrey. In addition to the the label, a confidence score is given: 76.5%.
Providing the confidence score is a helpful feature because it gives the user some information about the reliability of the prediction. But do keep in mind high confidence scores can also be wrong.
For our study we decided to use the predicted label for gender that each company provided regardless of the confidence.
This follows the out-of-the-box testing approach advanced by the National Institute for Standard and Technology(NIST) for gender classification. Their work shows those who procure facial analysis systems often just use the default settings. To mirror real-world use where customers are not likely to be machine learning experts, we evaluated each company by the male or female label provided for an image.
This decision makes a difference in the accuracy numbers reported. For example when IBM replicated our study with a similar dataset, they reported a significant improvement on their worst performing subgroup, darker-skinned women. They went from 65.3% accuracy to 96.6% using a .99 threshold. This means a label was only considered female if the confidence score was 99% or more. For our study instead of setting such a high threshold, we took the approach of using the gender label provided by the service. Our approach showed 83.5% accuracy. The discrepancies in thresholds and reporting methodology are exactly why we need stricter standards, detailed user guidance, and external evaluation that is based on real-world use.
Anytime a company claims to have completely accurate systems or makes claims they are unable to replicate a result but do not provide the detailed demographic breakdown of their benchmarks or evaluation methods, be skeptical. The methodology for the Gender Shades evaluation approach used in both studies is available in my peer-reviewed academic paper.
Update* It has been reported that one of Amazon’s police clients reported not setting thresholds at all while using Rekognition.
“One Systems Project Analyst working in nearby Clackamas County — whose correspondence was made public via the ACLU-obtained FOIA documents — wrote that Rekognition’s “documentation is very lacking or wrong.”
“I think this demonstrates the ridiculousness of Amazon’s position,” Matt Cagle, a technology and civil liberties attorney with the ACLU of Northern California, told Gizmodo, “on the one hand they talk about guidelines, but then the company’s own customers may not actually be following them, and that is unacceptable when we’re talking about a technology as dangerous as face surveillance that is being used by law enforcement agencies. Amazon should act.” -Read Full Article
Furthermore, Amazon continues to be inconsistent about the recommended threshold for policing context. When the ACLU showed Rekognition matched 28 congress members to images in mugshot databases the company said they recommend using a .99 threshold. In other corporate communications and now as their client undermines Amazon’s claims of providing guidance, the company states a .95 threshold. The inconsistency reveals that the selection of threshold is context specific and supports the National Institute of Standards and Technology assertion that clients do not change defaults.
4. Even when companies announce technical improvements, older versions of their AI services may still be in use. Like a car recall , the new models may have new problems and the older models persist.
Amazon states that they have made a new version of their Rekognition system available to customers since our August 2018 audit. This does not mean all customers are using the new system. Legacy use often occurs particularly when adopting a new system can mean having to invest resources into making updates with existing processes. At the time of writing this, there is also no indication that the new version has been externally evaluated in real-world cases with the results submitted for public scrutiny.
My question to Amazon and any company that releases a new system is what is the adoption rate and how may customers are still using previous versions? For any publicly funded agencies using or thinking about using Amazon services, they need to check the different versions and demand external testing of those versions for bias. Given everything that is known about the risk of facial analysis technology, I support moratoriums to halt the police use of these technologies.
5. What does attribute classification or gender classification have to do with facial recognition in law enforcement?
Some critics have said police would not likely find attribute or gender classification relevant in their work. However policing often relies on looking for a profile of a person of interest based on demographic attributes like gender, race, and age as well as physical characteristics like facial hair. Different facial analysis services provide features for gender and age classification as well as things like facial hair. As I write earlier:
An Intercept investigation reported that IBM used secret surveillance footage from NYPD and equipped the law enforcement agency with tools to search for people in video by hair color, skin tone, and facial hair. Such capabilities raise concerns about the automation of racial profiling by police. Calls to halt or regulate facial recognition technology need to contend with a broader set of facial analysis technology capabilities that go beyond identifying unique individuals. — Amazon’s Failed Machine Learning
Even when facial recognition technology is being used, it can rely on gender classification and other types of attribute classifications. To try to see if a face in an image or on a video is a person of interest, the detected face may need to be compared to faces in databases containing more than 100 million faces. In the US alone, Georgetown released a study showing 1 in 2 adults, more than 117 million adults, have their faces in facial recognition databases that can be searched unwarranted using AI systems that haven’t been audited for accuracy.
To reduce the time it takes to search for the face, gender classification can be used. So now instead of looking at every face, if the gender of the detected face is known, you can reduce the number of potential matches that need to be processed. More details about the way in which gender classification can be used in policing is available from the National Institute for Standards and Technology (NIST).
Furthermore, as I wrote Jeff Bezos any bias in Amazon tools that analyze faces should compel thorough external vetting of these systems.
To my knowledge Amazon is providing facial identification services for policing which is not the same as gender classification. In the case of gender classification which has been essentially reduced to a binary, the technology has a 1 in 2 chance of getting the answer right simply by guessing. With facial identification, the chance of guessing the correct face by chance is based on the number of potential face matches stored. So for example if there are 50,000 faces to be matched against and a person of interest is identified, the chance of randomly guessing the right individual is 1 in 50,000. And guessing the wrong individual subjects innocents to undue scrutiny as has been reported on by Big Brother Watch.
Given what we know about the biased history and present of policing, the concerning performance metrics of facial analysis technology in real-world pilots, and Rekognition’s gender and skin-type accuracy differences on the easy Pilot Parliaments Benchmark, I join the chorus of dissent in calling Amazon to stop equipping law enforcement with facial analysis technology . — June 25, 2018 Algorithmic Justice League letter to Jeff Bezos
I was quite surprised to see that after a year of scrutiny of these systems that Amazon did not submit to the recent NIST benchmarks and had such high accuracy disparities on our modest test. Other vendors of facial analysis technology took the Gender Shades study seriously and have made substantial changes.
I urge Amazon to consider signing the Safe Face Pledge to show a good faith commitment to the ethical and responsible development of facial analysis technology. In light of this research, it is irresponsible for the company to continue selling this technology to law enforcement or government agencies. As an expert on bias in facial analysis technology, I advise Amazon to
1) immediately halt the use of facial recognition and any other kinds of facial analysis technology in high-stakes contexts like policing and government surveillance
2) submit company models currently in use by customers to the National Institute of Standards and Technology benchmarks
Joy Buolamwini is the founder of the Algorithmic Justice League which uses art and research to illuminate the social implications of artificial intelligence. Her MIT Thesis Gender Shades uncovered the largest gender and phenotypic disparities in commercially sold AI products. She is a Rhodes Scholar, Fulbright Fellow, and a Tech Review 35 under 35 honoree who holds three academic degrees.