Shining a spotlight on re-identification
The announcement earlier this month by the UK government of an overhaul of data protection laws sparked headlines heralding the new rights of citizens to control their data.
To those who have been paying close attention to data laws, this came as no surprise. The new provisions around consent and the ability to withdraw consent are in the EU’s General Data Protection Regulations, due to come into force in May 2018, and published many years ago, and 2017 manifestos for both the Conservative and Labour Parties included such a pledge.
The most eye catching announcement for me was the news that the government would create a new criminal offence: “of intentionally or recklessly re-identifying individuals from anonymised or pseudonymised data” punishable by an unlimited fine.
Whilst I welcome this unexpected legal protection for anonymised and pseudonymised data, I fear that it signals a shift in the burden to protect such data away from the original data controller, whose responsibility it is to ensure that data cannot be re-identified.
Anonymised data is simply data about people where the individuals are no longer identifiable, usually, because the results have been aggregated. At a very basic level, pseudonymisation maintains the data in a non-aggregated form but removes or hashes identifying features, such as a name. Both have the advantage of making the data “non-personal” and taking it outside the scope of data protection laws, making it easier to store, move, share and analyse.
Poorly pseudonymised data is vulnerable to re-identification — sometimes even by people unwittingly doing so, simply because they have some special knowledge of the individuals involved and can recognise them from the details supplied. On a mass scale, re-identification can sometimes be possible when information disclosed can be matched to other information in the public domain.
One of the most notorious examples involved Netflix, which 10 years ago released “anonymised” ratings data as part of a competition to improve its recommendation algorithm. Researchers were quickly able to match ratings given on Netflix with other sources such as IMDb, and identify individuals. This led to a US class action law suit, which was joined by an “in-the-closet lesbian mom”, who claimed she risked being outed by Netflix.
Earlier this month, it was announced that German researchers had been able to purchase “anonymous” browsing histories of 3 million Germans and were identify most of them — as many of these records included a social media handle, which could then be linked to a real person.
In both cases, the re-identification risk was highlighted by the data equivalent of an ethical hacker and was undoubtedly a good deed, as it called out the irresponsible actions of others. Even though the government says it will protect whistleblowers, the threat of criminal sanctions may put off many similar ethical projects and give a free pass to poorly anonymised data.
The protection of whistleblowers should go further and the government should implement a notification system to the ICO so that researchers can easily receive advance clearance for re-identification projects and be shielded from the risk of prosecution.
Of course, there are many non-ethical examples of re-identification, at which this law will be aimed. In its Anonymisation Code of Practice, the ICO gives some examples of the reasons someone might want to do this:
- finding out personal data about someone else, for nefarious personal reasons or financial gain;
- the possibility of causing mischief by embarrassing others;
- revealing newsworthy information about public figures;
- political or activistic purposes, eg as part of a campaign against a particular organisation or person; or
- curiosity, eg a local person’s desire to find out who has been involved in an incident shown on a crime map.
It is quite hard to envisage a real life example of how the crime of re-identification would be the subject of a prosecution, without another crime being committed at the same time, for example fraud. A more sensible approach would perhaps be to make re-identification an inchoate offence, such as conspiracy, which relies on the ultimate aim being the commission of another offence.
I await with interest how the government intends to define “anonymised” and “pseudonymised”, and whether it will encompass even the simplest forms. For example, would it be a criminal offence to identify the individual behind a Twitter handle, or the person who leaves the abusive comment on a blog post?
More importantly for data sets, there needs to be a minimum standard of pseudonymisation that will be required to earn the protection of this new law. The current guidelines are incredibly vague and loose, with the ICO stating that data owners should consider various factors such as who might have special knowledge allowing them to re-identify data (and whether they are likely to attempt it), what other data is in the public domain, the availability of the required computing power, and the consequences for the data subjects.
Of course, releasing badly anonymised data is already going to fall foul of existing data protection but surely this is where the ICO needs to tighten up and lay out clear tests that anonymisation must pass before it can be legally considered non-personal data.
As much as new technology can help attackers re-identify data, new technology is available that prevents it, or — as we are developing — ensures that the data can be used for analysis without the need to release it in the first place and risk re-identification. The wide understanding of re-identification techniques and the availability of technical solutions mean that poorly anonymised data should become a thing of the past, and its release without the appropriate technical protection should also be criminal act.
A shorter version of this article first appeared in CityAM