Automatic Facial Recognition: Why do we need a human in the loop?

8 min readMar 26, 2019

Last week, I was fortunate enough to be invited by the University of Alcalá de Henares to talk on why we need humans in the loop when using automated facial recognition systems.

In this blog post I’m going to talk about some of the ideas from that presentation and make the case for why humans are still an integral part of any automated facial recognition system, particularly when facial recognition is used in security critical and high risk settings.

Reuben speaking at El Rostro: Neuvo Reto Technológico y Forense

When we look to the latest news stories on facial recognition technology, to say there are mixed messages on the accuracy of the technology is quite an understatement. On the one hand, results from the latest tests of commercially available algorithms show massive gains in accuracy, with top performing systems able to identify matches with only a 0.2 percent error rate. On the other, a news report claims that recent live trials of facial recognition in the UK ‘returned false positives in more than 98 per cent of alerts generated’.

Can these apparently conflicting accuracy rates both be correct? Well, the short answer is yes. But just looking at these numbers is certainly not giving us the full picture. It all depends on how we measure accuracy and whether we are factoring in human review.

Can these apparently conflicting accuracy rates both be correct? Well, the short answer is yes.

Accuracy and human review are critical factors when high-risk decisions might be being made, either wholly or in part from the results of an automated facial recognition system, such as deciding whether or not to arrest someone in a criminal investigation, identifying child abuse victims or deciding to detain or admit someone at the border.

When the stakes are this high, human review is an essential part of the facial recognition process.

Automated Facial Recognition in Applied Settings

Automated facial recognition (AFR) technology is now commonly used in security critical and high-risk settings, such as verifying someone’s identity at the border or identifying a suspect in a police investigation. When we look at AFR in these applied settings there are three broad approaches to how it is used:

Scenario 1: Verification (1:1)

Perhaps the most simple of the three scenarios, this is how AFR technology is applied at the border using e-gates. A person presents an identity document, like their passport, the system enrols the facial image from the document and compares it to the live person.

If the person and the image are sufficiently similar and meet a certain threshold then the system admits them through the border. If there is insufficient similarity to meet the threshold then a human reviewer will be required to adjudicate the verification decision. 1:1 verification can also be performed between two images.

Scenario 2: Identification (1:N)

In this scenario we have an image of an unknown person and need to find out who it is. The unknown image is enrolled in the AFR system and compared by the AFR algorithm to a gallery or database of known images. Gallery sizes can range from just a handful of images into the tens of millions.

The system compares the unknown probe against each image in the gallery and calculates some kind of similarity score. Depending on how the system is configured, either a set number of the top scoring images are returned as a candidate list or images that score above a certain threshold are returned.

A human operator has to manually review the candidate list to determine if a match to the unknown probe image is present. 1:N (One:many) Identification can also be used real-time to search a live subject against a database, followed by human review.

Scenario 3: Clustering (N:N)

The third scenario is clustering. If we have a set of unidentified images, AFR technology can be used to group faces into sets or clusters based on how similar the faces score to each other. A human operator then reviews the clusters to accept any potential matches and reject any non-matches.

You’ll have noticed that in all three scenarios a human is required to confirm whether the automated system got it right and adjudicate on any difficult or borderline ‘matches’. This is because no AFR system is 100 percent accurate. How accurate the system is will depend upon the performance of the underlying algorithm, the quality of the facial imagery and how the system is configured.

Measuring accuracy — It’s a thresholding issue

If we want to know how accurate an AFR system we need to dig a bit deeper than just the percentage of matches it gets right to understand how well it is really doing. And when we do that the need for human review becomes much more obvious.

There are two statistics that are key to understanding AFR accuracy, the false alarm rate (FAR) and the false reject rate (FRR). Both the FAR and the FRR are intrinsically linked.

The FAR measures how many times the system incorrectly reports that two non-matching faces are a match (also known as a false positive).
The FRR measures the opposite, how many times the system misses a matching face (a false negative).

The FAR and FRR of a system are not mutually exclusive and by decreasing one type of error we may increase the other. This error trade-off is adjusted by thresholding.

How effective the system is at determining whether two faces match depends on where we set the threshold for what is considered a match, based on the similarity score generated by the system.

High Thresholds

If we set a very high threshold then only images with very high similarity scores will be considered potential matches (see on the left of Figure 1). This will mean our false acceptance rate will be very low but we may miss matches, giving us a higher false reject rate (the green squares under the threshold line).

A high threshold system would work well at the border for 1:1 verification because we don’t want to wrongly admit someone. But we will get some instances where a person doesn’t pass the threshold even though they have the correct passport. This may be due to poor lighting, ageing or alterations in appearance. These small number of cases will be sent to a human for review. If the human decides the system made a mistake the person can be admitted and go on their way. If the human operator agrees with system and confirms that the passport image does not match the person they will undergo further investigation.

If the threshold is set too high the system will end up generating many false alerts that require human review. This would be not only a frustrating experience for people trying to get through the border but also inefficient, wasting the time of the human reviewer. Setting the correct threshold will be specific to the environment that the AFR system is being used in and will require testing and tweaking to get right.

Figure 1 — Example of the effects of a high threshold (left) and a low threshold (right) on matching accuracy. Any data points above the threshold are considered by the system to be a potential match, data points below the threshold are considered to be a non-match

Lower Thresholds

A high threshold model is not so useful in a 1:N identification scenario where we are searching an unknown face against a database of known faces, as we may miss our potential candidate due to the relatively high false reject rate.

Having too high a threshold would be high-risk in a policing scenario where we are trying to identify a suspect from a facial image, particularly if the image is poor quality such as from a CCTV system. In a 1:N identification scenario if we search images and have a very high threshold it’s possible we will miss the suspect. In this case it would be preferable to lower the threshold and have a human review the returned candidate list.

By lowering the match threshold we reduce the chance of missing a match in our candidate list (low false reject rate) but if we don’t check the results we increase the chance of a false match (higher false acceptance rate), the red dots over the threshold line on the right of Figure 1.

If we didn’t review the results from an AFR system performing 1:N identification at a low threshold, the risk of errors may be very high. Think back to the ’98 percent false positive rate’ earlier, this number didn’t factor in any kind of human review and instead considered every image returned above the threshold that wasn’t a match a false positive. In this instance, a human did review the results and thus the overall error rate of the system will have been greatly reduced.

So we need a human to review these results to pick out the true matches and discard the non-matches. Any potential matches will be passed on as a substantive line of enquiry to an investigator. If no match is found the operator can adjust the image or the system configuration and search again or inform the investigator that a viable match has not been found.

Humans and Machines: We work better together

So, back to 0.2 percent error rates I mentioned at the start of this post. The US National Institute for Standards and Technology (NIST) run independent evaluations of commercially available and prototype facial recognition technologies. In the latest test from 2018, NIST found massive gains in accuracy compared to the 2013 results, with many algorithms in the latest test performing more accurately than the top performers from 2013. This due in large part to advances in the use of deep learning and convolutional neural networks.

The most accurate algorithms from 2018 can match high quality facial images with error rates below 0.2 percent. BUT these are in fairly optimised conditions. NIST also say that for suboptimal conditions (e.g. lower quality images, or faces with a significant age difference) ‘true matches become indistinguishable from false positives and human adjudication becomes necessary’.

Although facial recognition technology is improving rapidly, when it is used in suboptimal conditions human review is still required. Particularly if high-risk decisions are being based on results from the system, such as arresting a suspect or detaining someone at the border.

Both humans and machine algorithms have their respective strengths and weaknesses and often these complement each other.

Both humans and machine algorithms have their respective strengths and weaknesses and often these complement each other. By designing and implementing AI and machine learning systems like AFR from a human-centric perspective we can improve the performance of both the AI and the human operator.

Next time I’ll be looking at the other side of the partnership and talking about the accuracy of human reviewers in facial recognition.

Want to know more? Sign up to our newsletter for articles straight to your inbox.

www.qumo.do | Reuben Moreton

Reuben is the Identity and Biometrics Lead at Qumodo. He brings expertise in facial recognition and identity which he developed during his time working in forensics for the Met Police. With a MSc in Forensic Science, he has already published multiple papers on the subject and is currently studying for his PhD in Psychology.

Follow Reuben on Twitter

Automatic Facial Recognition: Why do we need a human in the loop?

Automated Facial Recognition in Applied Settings

Measuring accuracy — It’s a thresholding issue

Humans and Machines: We work better together

Written by Qumodo Ltd