The data science that explains why mass surveillance is bad

The British government has been trying to read our email for a while, but their new investigatory powers bill is so alarming it has drawn criticism from everyone from United Nations chiefs to the CEO of Apple.

Designed, amongst other things, to legitimize government snooping and force companies to install backdoors in software, the bill also forces internet service providers to record the internet history of every British citizen. The domains they’ve visited will be available to the government without any kind of warrant.

A list of domains you’ve been to — like,, perhaps even — while intrusive, seems harmless enough. If you’re not visiting, you’ve got nothing to worry about, right?

Unfortunately, the way this data is analyzed means the truth is a little more sinister.

The British security services want to discover who is a terrorist. The image that comes to mind is likely a room of intelligence analysts poring over files of suspects’ evidence.

In reality, huge organizations like GCHQ use automated tools to sift through millions of citizens’ internet history with minimal human touch. These tools are based on something known as data science — the same stuff used to target online adverts, or suggest people you might know on Facebook.

These techniques can be powerful and sophisticated. But given a fundamental tradeoff at their very heart, it’s guaranteed they’ll make mistakes.

Take a list of medical data for thousands of patients: age, ethnicity, medical history, and whether or not they have a certain disease. Feed it into a computer. The computer builds a statistical model of which variables suggest disease — in effect, it learns what a diseased patient looks like. Now, when you feed in a new patient’s data, the computer will suggest whether they might be at risk.

This concept is called machine learning, and it’s really useful for solving classification problems — predicting whether something fits into a group, whether that’s diseased vs. healthy or terrorist vs. civilian.

The difficulty comes when the system isn’t sure. Shown a new patient, the system generates a numeric score that suggests which group it belongs to. Sometimes, the score falls on the borderline — it’s not clear from the evidence whether the patient will be healthy or diseased.

To interpret this result, data scientists must calibrate the threshold for considering someone at-risk of disease. Their choice is between false positive rate and false negative rate. A high false positive rate means the system is more likely to misidentify a healthy person as sick. A high false negative rate means it’s more likely to claim a sick person is healthy.

In a medical context, it’s important that any evidence of health problems is dealt with right away. The threshold would be calibrated so that any patient who might have the disease is flagged as a ‘positive’ and referred to a specialist. This means we’ll have a high false positive rate, but no illness will be missed.

The British government’s surveillance apparatus is tasked with identifying terrorists. If they can catch a terrorist, the reasoning goes, they’ll be preventing an attack. Naturally, this means that if they don’t catch a terrorist, an attack is sure to happen and lives will be at risk.

Training their machine learning system, they feed in the personal history, social networks and internet history of dozens of known terrorists — along with millions of known civilians. It learns to distinguish one from another.

Once trained, the system is configured to run automatically across the internet histories of every British user. It does so tirelessly, twenty-four hours a day. Based on who you are and the sites you visit, it generates a numeric score — its level of confidence that you are not a threat.

It’s inconceivable for the authorities to let an attack happen. If they suspect someone is a risk, they’re forced to act. The way machine learning works, there’s always some ambiguity. A false negative could lead to preventable deaths. So their systems will be tuned for a low false negative rate, meaning a high occurrence of false positives. Which means people identified as terrorists who are anything but.

Imagine you’re at the airport for a hard-earned trip abroad. On your way through security, a Border Force officer pulls you aside. They collect your belongings and take you to a back room. The system, they say, has flagged you as a risk. You’re free to go home, but you’re not allowed to fly today.

False accusation is just one of many issues. Is it wise to build a system where a vulnerable teenager clicking through jihadi blogs will end up being treated as a terrorist? Does the burden of suspicion and assumption of guilt create a self-fulfilling prophecy, pushing those under the spyglass from curiosity to action?

This indiscriminate profiling is jet fuel for radicalization. The idea that your every action is scrutinized, with an automated system arbitrarily deciding your fate — it builds resentment for authority and a deep sense of unfairness. It does irreconcilable damage to the relationship between citizen and state.

Given the technical limitations of data science tools, they can only speak collectively and approximately. It’s not possible to employ these tools without violating our expectations of thorough, individual justice and the right to free thought and speech.

The reality of the situation is inconvenient for our authorities. The very nature of technologies that enable mass surveillance mean they’re not fit for purpose in preserving British life.

If you’d like to share your thoughts, find me on Twitter.