Trueface Tackles Facial Recognition Gender and Ethnicity Bias

Cyrus Behroozi
Trueface
Published in
5 min readMar 30, 2020

It is no secret that there is very often bias present in machine learning algorithms and in particular, facial recognition algorithms. The presence of bias means that facial recognition (FR) algorithms are less accurate for certain demographics compared to others. Studies have shown that facial recognition algorithms tend to work best on white skin males and are less accurate with black skin females. One look at the NIST Facial Recognition Vendor Test results confirms this; nearly all algorithms have a higher false positive rate for black skin individuals and females (as denoted by the pink squares below).

False Positive Rate by Demographic (Pink Indicates Worse Performance), NIST FRVT 11 Report, 2020-02-27

So what causes this bias? In short, it is predominantly due to the imbalance in many of the available public and private datasets used in training the models. A machine learning model performs best with data which is similar to that used to train the model itself. Therefore, since these training datasets contain mostly images of white males, the models will perform best towards this demographic. The figures below depict the ethnicity and gender distributions for some of the popular FR training datasets. It doesn’t take an expert to see that significant imbalances exist.

Racial Composition of Popular Datasets
Gender Composition of Sample Private Dataset

One way to mitigate this problem is to ensure you have roughly the same number of samples from each group during the pre-processing stage before commencing training (this may involve removing some images from the more dominant groups). This can prove to be a challenge as many of the FR datasets do not have ethnicity or gender labels. Additionally, in the case where one group is significantly underrepresented, a large amount of valuable data must be removed from the more dominant groups to re-establish a balance.

At Trueface, we hold ourselves to the highest standards of ethics and strive to be fully transparent. As such, we are releasing the results of an evaluation we performed to quantify the bias in our model. To perform this evaluation, we used Fairface, a unique dataset that is labeled with ethnicity, gender, and age groups. Unlike most public datasets, Fairface (as the name cleverly suggests) contains roughly the same number of samples from each ethnicity and gender group.

Fairface Dataset Composition

For this evaluation, we take a subset of the Fairface Dataset (64,512 / 86,744 images) and ensure there is equal representation from all the ethnicity groups. We then compare all the images in the subset against each other to compute the facial recognition similarity score for each pair of images. Since all the images in the dataset belong to different identities, the comparisons are considered impostor matches, and the distribution of similarity scores should be centered about 0.

We then compute the false positive rate at varying similarity-score-thresholds ranging from 0 to 1 for comparisons among each of the ethnicity and gender groups. We can plot the false-positive rates vs the thresholds to give us an idea of the model's performance: in an unbiased model, there should be minimal spread in the false positive rate curves for each of the groups. With respect to the graph, we expect the false positive rate curve to decay because an increase in similarity-score-threshold will produce fewer false positives. Additionally, curves that are lower on this graph are considered to have better performance (fewer false positives at that threshold).

Depending upon the user’s application, we generally advise our partners to begin operating at a similarity-score-threshold of about 0.35 and tune for performance from there. For this reason, we will focus on that region of the plot. Note the scale on the left side of the plot.

At the 0.35 similarity threshold operating point, the difference in the false-positive rate for the gender curves is 0.0001223, or about 0.012%. Contrary to most algorithms, our Facial Recognition model works better with females than males at this threshold. Some might even consider the difference to be so minute that the model can be said to have negligible bias at this operating point.

When testing for the performance of different ethnicities, the difference between the best performer (White) and worst performer (East Asian) at the similarity threshold of 0.35 is 0.00260429, or about 0.26%. As we can see, there is indeed bias in the model at this operating point but that bias translates into minimal performance loss.

So what does an unbiased facial recognition model mean and how does its performance translate into real-world deployments? Let us unpack the benefits of an unbiased model for both the business deploying the model and the end-user. For the business implementing facial recognition into its architecture, a model with minimal bias equates to a single dynamic model that can be deployed around the world without needing to be retrained or fine-tuned, which can end up delaying the time to value for the business. If we take an example of a multi-national pharmaceutical company using facial recognition as a second-factor authenticator to bolster security in its offices in America, Japan, Pakistan, Belarus and the United Kingdom, all offices which employ differing ethnicities can expect similar results from the single model with minimal bias. Thus the pharmaceutical company achieves its goal of bolstering security much more quickly than if it tried to implement the same biased model around the world with varying degrees of performance for all ethnicities involved.

For the end-user, a deployed model that is proven to have minimal bias will result in similar performance for all ethnicities and genders. Let’s take another example of a regional grocery chain using facial recognition for frictionless and touchless checkout amongst its loyalty members. In this example, the grocery chain has locations in several neighborhoods of the same city, each with its own cultural diversity. At each store, members who vary in ethnicity will have the same benefit from the frictionless checkout process, providing the same experience of added convenience for all.

When we set out to make the world a safer and smarter place, we did so with all genders and ethnicities in mind. Computer vision and machine learning provide untold benefits to our society and we are here to make sure those benefits are accessible to all.

--

--