Live Facial Recognition: how good is it really? We need clarity about the statistics.

Published in

WintonCentre

12 min readFeb 10, 2020

David Spiegelhalter and Kevin Mcconway

Imagine you are in a crowd, and that you are pulled over by police who say your face appears to match someone in a list of people wanted for questioning. And then imagine how you would feel if you heard the system was claimed to be ‘80% accurate’. Possibly a bit disgruntled?

Live facial recognition (LFR) involves a system of cameras and software that scans the faces of individuals in a crowd, and in real-time compares each of their images with a ‘watchlist’ comprising people who are being sought by the police: if the system declares a ‘match’ then a human operator is alerted who rapidly makes a judgement whether the match is plausible, and if so then the suspected individual is, if feasible, stopped and questioned and a final confirmation made of their identity. It is an experimental technology that is contested on legal, ethical and practical grounds, particularly regarding rules for its use, and possible biases in the matching algorithm that mean certain classes of individual are more likely to be wrongly identified.

The issue has had extensive media coverage recently, with headlines such as Facial recognition cameras will put us all in an identity parade, Meadowhall facial recognition scheme troubles watchdog, and Chicago-Area Groups Demand End to Facial Recognition Amid Concerns.

With such a hotly contested issue, it seems essential that any public discussion of the performance of LFR is clear and unambiguous. Unfortunately commentators often appear confused about the issues and use misleadingly ambiguous terms. Three recent examples illustrate the problem.

1. On 27th January, in a debate in the House of Lords about LFR, government representative Baroness Williams claimed “As for inaccuracy, LFR has been shown to be 80% accurate. It has thrown up one false result in 4,500”. But it is unclear what Baroness Williams meant by ‘80% accuracy’ (and we suspect Baroness Williams may also be unclear what Baroness Williams meant). For example, it could mean each of three very different things:

Of 10 alerts, we would expect 8 to be on the watchlist of people being sought
Of 10 people on the watchlist who are scanned, we would expect the system to correctly identify 8 of them
Of all judgements made by the system, including when it decides that a scanned person is not on the watchlist, 80% are correct.

It is ironic that the phrase ‘80% accurate’ was recently used as a running joke in the recent Royal Institution Christmas Lectures, with Dr Hannah Fry and Matt Parker using it as an example of a misleadingly ambiguous term.

2. On 29th January, the Centre for Data Ethics and Information (CDEI), an independent body set up by the UK Government to advise on the governance of data-driven technologies, tweeted “MPS [Metropolitan Police Service] have limited the number of false matches so it is never more than 1 in 1000 passers-by, however this will lead to fewer matches for ‘people of interest’. MPS claim that they still correctly matched 70% of people in their tests, but have not provided more information”. Again, what does ‘correctly matched 70% of people’ mean?

3. On 24th January, Danny Shaw wrote for BBC Online, “The results suggested that 70% of wanted suspects would be identified walking past the cameras, while only one in 1,000 people generated a false alert.” This is admirably clear, but he then suggested this was in conflict with another analysis, saying “But an independent review of six of these deployments, using different methodology, found that only eight out of 42 matches were ‘verifiably correct’.” We shall show below that there is no necessary contradiction between these statements. [The article has since been edited to correct the impression of a contradiction.]

The ambiguous nature of these terms is not unique to LFR screening. Within medicine, people working on systems designed to screen for disease have long recognised that the great potential for confusion, and so terms such as ‘accuracy’ should be carefully avoided. Indeed in November 2019 the UK Advertising Standards Authority banned some online advertisements put out by three companies who market prenatal tests for genetic abnormalities, whom it said were using measures of ‘accuracy’ in a misleading way.

Screening for breast cancer using mammography provides a case study into how communication can be made clearer.

Mammography as a screening test

The performance of a screening test for a disease is characterised by three measures:

The proportion of people with the disease, who will get a positive test. This is known as the sensitivity, and for mammography is around 90%, so the test will miss 10% of those with cancer.
The proportion of healthy people without the disease, who will get a (correct) negative test. This is known as the specificity, and for mammography this is around 97%, so the test will incorrectly give a positive result for 3% of those without cancer. This 3% is sometimes known as the ‘False Positive Rate’, but this is an ambiguous term that is best avoided.
The proportion of people, in the population being screened, who have cancer. This is known as the base rate or prevalence, and which we shall take as 1%.

Research by psychologist Gerd Gigerenzer and others on the communication of screening tests strongly suggests that clarity is achieved by avoiding terms such as sensitivity, specificity, chance, probability, accuracy and even percentages, and instead clearly stating the numbers of various events expected in a group of people, known as ‘natural frequencies’ or ‘expected frequencies’.

For example, using the numbers given above, we can work out what we would expect to happen to 1000 women going for screening. 10 (1% of 1000) have cancer, and 9 of those get detected (90% sensitivity). But 3% of the 990 without cancer also get a positive test (one minus specificity), which is 30 women. So there are 39 positive test results, of which 9 actually have cancer: only around 1 in 4 of the women with positive test results, who are recalled for further investigation, have cancer. This is sometimes known by the clumsy expression ‘predictive value of a positive test’.

This reasoning is shown in Figure 1 below, using two representations of the natural frequencies: a frequency tree, and a frequency table. Both are correct, and both are now part of the GCSE Mathematics syllabus as a way to teach probability**.

*Figure 1. Expected frequency of different outcomes for 1000 women going for breast screening, assuming a baseline rate of 1%, a ‘sensitivity’ of 90%, and a ‘specificity’ of 97%.*

Leaflets provided to women being invited to breast screening used to contain claims about its ‘accuracy’ in terms of sensitivity and specificity, but when the leaflets were revised it was realised that the crucial information of interest was: if I get a positive test result and am recalled, what’s the chance I have cancer? Our analysis above shows that, of 100 women being screened, around 4 will get a recall and only one of these will have cancer, and this is the information that is now shown in the leaflets using the infographic in Figure 2.

*Figure 2: The infographic used in the* *current NHS breast screening leaflets* *to show what we would expect for 100 women going for screening*

Perhaps the most vital take-home message is the massive difference, both logically and numerically, between two measures:

Of 100 women with breast cancer, 90 will have a positive test.
Of 100 women with a positive test, 25 will have breast cancer.

When written clearly in this way, it would seem absurd that anyone could get confused — it would be like mixing up the statements “Most Popes are Catholics” with “Most Catholics are Popes”. But this confusion occurs repeatedly and even has a name: the Prosecutor’s Fallacy. This is because of the common courtroom error after DNA found at the scene of a crime matches a suspect’s, when the reasonable statement

If the suspect is innocent and someone else left the DNA, there is only a 1-in-a-million chance of having this degree of match

is then interpreted as

Since there is this DNA match, there is only a 1-in-a-million chance of the suspect being innocent.

which is dangerously incorrect.

What the Metropolitan Police Service say

Scrupulous care is therefore required when discussing the performance of screening tests, and the Metropolitan Police Service (MPS) has specified their own terminology on page 28 of this document:

True recognition rate (TRR) is the proportion of those on the watchlist and who were scanned, for whom an alert was correctly generated. In a medical context this would be known as the ‘sensitivity’ .
False Alert Rate (FAR) which is the number False Alerts generated as a proportion of the total number of subjects processed by the LFR system. This is essentially the same as one minus specificity.

As mentioned in the CDEI tweet, the MPS is tuning the system to have a False Alert Rate of less than 1 in 1,000, presumably by demanding more ‘confidence’ before a potential match is identified, which will inevitably lead to a smaller number of correct alerts too.

But the MPS do not appear to have terms for the quantities of particular interest that feature in the mammography example -

Out of say 10,000 people, how many alerts are generated?
Out of say 20 alerts, how many are false?

These two quantities seem fundamental, in terms of the ‘costs’ of the system in the broadest sense.

Let’s see whether this background understanding can make our three examples clearer.

Baroness Williams’ claims

Later on in the debate, Baroness Williams repeated her numbers using slightly different language “there is a one in 4,500 chance of triggering a false alert and over an 80% chance of a correct one”. One of us tweeted a request for anyone who knew where these numbers came from, or even what they meant, and Richard Van Noorden kindly replied suggesting the “1 in 4,500” comes from the South Wales Police 2018 claim that 10 false matches out of 44,468 faces scanned at an event in Swansea.

That seems to explain the 1 in 4,500 as what is technically ‘one minus specificity’, or the False Alert Rate in the terms of the MPS. But what about the ‘80% accuracy’ or ‘80% chance of a correct one [alert]’?

This South Wales study said that two alerts were issued for matches confirmed by the operator, with one person arrested for an outstanding warrant. So perhaps only 2 out of 12 alerts were correct, and 10 out of 12 were false, so the predictive value of a positive alert was 2/12 = 16%. And, in fact, it’s not even clear that both of these two alerts were really correct. The operator thought they were, but the response to a Freedom of Information request from an activist group indicates that only one of the two was stopped (and was arrested). So it’s possible that the other one disappeared in the crowd, and the police had no opportunity to check whether the identification by the facial recognition system (and its operator) was really correct. If that’s the case, then there were really only 11 alerts that were check one way or another, and only one of those was verified (by the arrest). That would make the predictive value of a positive alert even lower, at 1/11 = 9%.

Our guess is that the 80% may be the sensitivity or the TRR in the MPS’s terms — of people in the crowd who were on the watchlist, the percentage who were detected. Of course this is essentially impossible to estimate in a real-world setting as it requires knowledge of how many of the people on the watchlist were truly scanned. That’s a problem that arises in some other situations where screening methods are used. With breast mammography, if the screening process does fail to pick up a case of breast cancer, it’s likely that that case will eventually be picked up in some other way. So, roughly speaking, we know eventually how many women had breast cancer when they were screened, and how many of those had positive and negative screening results. But with facial recognition, the people that the system does not match to the watchlist, in the real world, won’t have their identity checked in a more thorough way, so nobody can know whether they were actually on the watchlist or not. So presumably companies claim this TRR performance based on experiments with crowds seeded with actors, where the number of actors who are missed by the cameras is known.

2. CDEI tweet.

We guess that “correctly matched 70% of people in their tests” also refers to the sensitivity or the TRR in the MPS’s terms –of people in the crowd who were on the watchlist, the percentage who were detected.

3. Danny Shaw’s BBC article

This originally said “The results suggested that 70% of wanted suspects would be identified walking past the cameras, while only one in 1,000 people generated a false alert. But an independent review of six of these deployments, using different methodology, found that only eight out of 42 matches were “verifiably correct”.

This description is admirable in avoiding jargon, terms such as ‘accuracy’, and clearly explaining what these percentages and numbers mean. But there is no necessary contradiction in these findings. Suppose there are 11 suspects in a crowd of 34,000. 70% (the sensitivity) of the 11 are identified (ie 8). 1 in 1000 of the innocents are falsely identified (1 — specificity) (ie 34). So only 8 out of 42 “matches” are correct! The predictive value of a positive test is 8/42 = 19%, rather similar to the 16% found in the South Wales study. These (fictional) results are shown in Figure 3.

*Figure 3 Fictional data that could give rise to all the conclusions in the BBC article*

It is difficult to know whether alerts were correct or not. According to the research study from the University of Essex, “adjudicating officers” first decided that for 16 of the 42 the image recorded by the technology did not match the image on the watchlist. If they had actually stopped these people and done an identity check, maybe in some cases they would have turned out to be correct anyway, but who knows. Then the police tried to stop the remaining 26, but four of them were lost in the crowd (and they may also have been correct, or not of course). So that leaves 22, of which 14 were verified as incorrect after the identity check, and 8 were verified as correct.

So none of these numbers can be taken too literally.

Recommendations for commentators on the performance of LFR

Even if LFR were completely accurate, there would still be many crucial issues about its use. But in the meantime there is a need for real clarity when discussing its performance, and so we would recommend that commentators -

Never refer to the ‘accuracy’ of LFR.
Never use terms such as chance or probability, and in fact avoid percentages unless it is explicitly specified ‘% of what’.
Avoid the technical terms introduced by the MPS, unless fully explaining each time what they mean.
Instead, explain everything in terms of what we would expect, or what was observed, in a specified group of people of defined size, just as Danny Shaw did well for the BBC. For example:

Of 10,000 people in the crowd who are not on the watchlist, we would expect around 20 alerts
Of 100 people in the crowd who are on the watchlist, we would expect around 80 correct alerts.
Of 20 alerts, we would expect around 4 to be confirmed as correct matches.

All these three numbers are relevant, although the final one seems the most important to the public, as it determines how many people are falsely identified for each appropriate alert. Crucially, does this proportion depend on gender, ethnicity and so on?
Commentators need to be aware that, even for tests that seem reliable (in terms of sensitivity and specificity, or TRR and FAR), when screening a large population it is almost inevitable that the great majority of alerts are false alarms. Perhaps it is best to remember -
When looking for a needle in a haystack, even for the sharp-eyed, there’s a lot of bits of straw that look like needles.

** The book Teaching Probability by Gage and Spiegelhalter is a guide for teachers on using expected frequencies to teach this challenging topic. Gage and Spiegelhalter were advisors to the Department for Education and campaigned for these methods to be included in the syllabus, but perhaps were only successful because a special advisor in the Department was a fan of the ideas of Gigerenzer. The advisor was Dominic Cummings.

Live Facial Recognition: how good is it really? We need clarity about the statistics.

David Spiegelhalter and Kevin Mcconway

Written by David Spiegelhalter