Quantifying the Inherent Bias in Mainstream Facial Recognition Technology

Published in

XIX.ai, Inc.

7 min readApr 24, 2020

The most widely used FR training set is 50+ percent white and 70+ percent male: no wonder FR solutions fail accuracy tests for women and people of color.

Facial recognition (FR) technology has its friends and foes. One common complaint is that the technology is inherently biased — and a new analysis of the industry’s most commonly used open-source databases of facial images finds that those critics are correct.

Several prior inquiries into problems of FR technology, conducted by the ACLU and MIT Media Lab, exposed some of these biases — especially strikingly poor accuracy on females and people of color. Their analysis was performed on commercial FR products, many of which are still in the semi-experimental stage, as tech companies want to test the markets using early implementations without making massive capital investments. Most of these commercial products use the same core set of open-source datasets as their basis — a common practice for building prototypes because building proprietary FR datasets is time-consuming and prohibitively expensive.

The problem is, any systematic bias inherent in those open-source datasets permeates the commercial products.

To explore the root cause of bias in FR, we went right to the source: the most common open-source databases used to train FR systems.

XIX set out to examine top open-source datasets for gender, racial, and age biases. Any of these biases, if found, would reveal potential problems with any commercial systems built on those datasets, and provide reliable insights for addressing these problems in the future.

Among open-source datasets, the largest by far is MS-Celeb-1M, originally published by Microsoft Research in 2016 (and since altered by a joint research team from Imperial College London and FR startup InsightFace, who removed some duplicated entities and non-faces from the original MS-Celeb, making data more consistent) This altered version consists of 85,742 identities — by far the largest open-source dataset, yet really too small to be the basis of a commercial product. This dataset was used in most of the FR research projects published in 2019 and was instrumental in achieving current state-of-the-art results. Currently, for any company working on FR, this dataset would be a go-to solution in the absence of any proprietary alternative (which is uncommon due to the huge capital investment required).

Our hypotheses going in

In order to find the root causes of the poor performance of current FR systems in production, XIX formed several hypotheses about major imbalances in the open-source datasets:

Significantly more males than females (for a dataset of this size difference in 10% would have noticeable effects) — which would explain why commercial systems often fail to accurately recognize women.
Significantly more identities in the White ethnic group, with strong underrepresentation of black/Mideastern people and of black women in particular — which would explain especially bad accuracy reported by the ACLU and the MIT Media Lab.
Distribution of perceived age for males and females would be different — distribution for females would have a lower median and lower variation, meaning that all females represented in the dataset would be predominantly in their 20s.

Our approach to analyzing the data

The core of XIX’s technology consists of a scalable face search engine connected to the state-of-the-art neural network trained to recognize faces. Training such networks requires hundreds of millions of images that capture all variations in people’s appearances, face angles, lighting and many other factors. Of course, it would be nearly impossible to manually count, for instance, how many females images there are in an enormous dataset — instead, we use a host of specialized neural networks for analyzing major descriptive of individual faces (gender, age, and the ethnic group being the most basic). These networks were trained separately from the FR net to act as developers’ extra eyes: for every face in the system, they predict and store these parameters for reference — to alert us about imbalances in training data and prevent these biases from affecting the entire system.

We simply loaded the main academic datasets to our face dataset storage system. Aggregated statistics collected in the process are presented here as the most comprehensive and accessible description of the dataset’s contents.

The imbalances were far worse than we expected.

Gender imbalance

Empirical gender distribution in MS1M dataset

In total, 70.7 percent of images were identified as male, and 29.3 percent as female — almost 2.5x difference. Just for reference: for a task as sensitive to data imbalance as face recognition, an acceptable difference in genders would be no more than 10 percent. This result is consistent with our first hypothesis.

Of all imbalances found this one is the most worrying, as it puts the strongest limitations on overall quality.

Ethnic imbalance

Empirical ethnic distribution in MS1M dataset

Regardless of gender, 51.24 percent of images were classified as white, with 35.46 percent being white males, and 15.78 percent white females. This is the first striking imbalance that would render most applications impractical. 19.70 percent of total identities were labeled as Hispanic, which is closer to an acceptable range.

Asians constituted 11.21 percent of the dataset, with the healthiest male-to-female ratio of 1.33.

Indian people constituted 5.15 percent of the database with a male-to-female ratio of 1.65 — far too high.

For black people overall, the ratio was 9.91 percent, with 8.18 percent of black males and 1.73 percent of black females. For Mideastern people, the overall ratio was 2.79 percent, comprised of 2.36 percent males, 0.43 percent females. These two groups also showed the greatest gender imbalances inside them: 4.48x more Mideastern men than Mideastern women and 3.72x more black men than black women. That renders all the image data on black and Mideastern people essentially useless as a training set for commercial applications in diverse environments.

Age imbalance

Perceived age distribution grouped by gender in MS1M dataset

Distribution of females’ age is skewed towards 20s-30s. The worst thing about this is not the median age, but the lack of data for women of 40+ years of age — that’s where we’d expect sudden drops in accuracy.
No data for anyone younger than 20. This exacerbates one of the biggest problems in FR: no current models work well for teens, and this is primarily a data issue.

So even though none of these statistics should be accepted for production, this data source is de facto the most widely used training set.

For academic exercises such as research, the data source may be adequate, but the requirements for production use should be far more rigorous as there are real-world repercussions.

How to fix it

The issue we have examined here is purely technical and is relatively easily rectified with some good engineering and better data diversity.

There are currently no regulatory guidelines in this area, but based on this analysis and XIX’s own expertise in computer vision and FR, we’ve developed a set of guidelines to serve as a starting point for organizations who are either seeking datasets for training their own models, or creating their own datasets.

On the data side, we suggest taking the following key steps:

No more than 10% deviation between classes and groups in training sets: Reducing deviation within classes and groups for training set will result in more robust face recognition systems. We suggest keeping the difference between the number of images per class no more than 10% with at least 100,000 examples per class for production-ready systems. For example, for two gender classes and six ethnicities, that means at least 1.2 million faces, with about 100,000 faces per each ethnicity-gender combination, which is already 15 times more than the largest open-source dataset.
Ensuring input quality control: Current FR systems accept pretty much any detected face as a valid input for identification — even if it’s blurry or was taken in poor lighting conditions. We have seen that the lack of filtering there leads to sharp decreases in performance. To prevent that from happening, we implement our own algorithms for evaluating faces from input streams, requiring users to submit only sufficiently well-lit and sharp photos.

In addition, we think the industry should be working towards two goals:

Obtaining explicit consent for all images in the set: Any system that uses face recognition for (re-)identification must obtain explicit consent from users that the system identifies. As of today, systems that are currently in production, have built clusters that store face and associated personal identifiable information (PII) without knowledge of people represented within them. People represented in those systems have no control or visibility over their biometric data. This must be changed, companies must enable people to see the data, delete it and see how it’s being used.
Complying with CCPA and GDPR: The sensitive nature of biometric data requires a different approach to data handling. CCPA and GDPR are a great start, but having a concrete policy around certain uses of face recognition is required — as virtually no FR applications currently meet these regulations. Surveillance, unauthorized use and tracking must be addressed by functioning democratic societies.

Widespread usage of biased AI technology will affect those underrepresented in the datasets used for building it proportionately to the popularity of such software. Setting standards at least for the accuracy of FR algorithms in cases that can put people at risk (e.g. in airports), is reasonable both from a social and economical perspective. Well-defined requirements for overall accuracy implicitly enforce higher diversity in datasets, as it is one of the main contributing factors to an FR algorithm’s quality.