Will You Attend The Big Data Masquerade Ball?

When people talk up the promise of open data and big data analysis, there’s invariably an elephant in the room: the thorny question of data protection. Many of the more exciting possibilities touted for the technology rely on systems being able to process sensitive data sets such as patient health records, detailed logs of people’s movements, web browsing habits, what appliances they use and when, and so on.

But increasingly, people are only likely to permit this data to be processed by third parties if they can be sure it’s handled in a safe and secure manner that cannot expose any of their personal information.

Many organizations expect the greatest benefits of big data to arise from analyzing large data sets in aggregate, with a view to uncovering new insights and hitherto undetected patterns of behavior. And in order to gain people’s permission to use their personal information, they offer assurances of data protection, and claim data is “fully anonymized” before being processed. The trouble is, there are serious doubts about whether the techniques currently employed are robust enough for anyone to legitimately make such a claim.

How Anonymous Is Anonymization?

Generally, anonymizing data involves stripping out personally identifying information from data sets such as names, addresses, contact details, IP addresses, etc. But a number of papers have shown in detail how personal information can be reconstituted from supposedly anonymized data sets. With anonymized location data, for example, if you see someone moving between two spots, you can generally infer that’s someone’s home and workplace. This could then be compared against, say, open electoral register data or a list of location-tagged tweets posted from the same place to pinpoint someone’s identity with a high degree of accuracy.

In a much-cited 2009 academic research paper, Colorado Law School professor Paul Ohm claimed “data can be either useful or perfectly anonymous but never both”. In the paper, Ohm cited a number of supposedly anonymized data sets — including AOL search terms and Netflix viewing habits — that had been successfully de-anonymized. Advocates of anonymization, such as Canada’s Information and Privacy Commissioner Ann Cavoukian, claim that if data sets can be de-anonymized, it is only because they haven’t been properly anonymized in the first place. But in 2014, Princeton’s Arvind Narayanan and Edward Felten roundly debunked the claim in their paper “No silver bullet: De-identification still doesn’t work”.

The arguments rage on, but there’s clearly sufficient doubt about the efficacy of anonymization techniques to put many people off the idea of sharing their sensitive data. Another technology that’s beginning to emerge, however, could indeed prove to be the silver bullet for big data protection: homomorphic encryption.

The Hope of Homomorphic Encryption

Essentially, homomorphic encryption employs some nifty mathematical techniques to allow computations to be performed on encrypted data and return encrypted results, without that data being decrypted at any point. It’s not a new concept — indeed, it has long been the ‘holy grail’ for data security — but only in the last few years has the technology begun to mature to a level where it looks like it might become viable for widespread use.

The first homomorphic system (developed by IBM) took 100 trillion times longer to analyze the encrypted data than if that data had been in plain text. In the intervening years, the company has managed to speed that up by a factor of around 2 million, but that still puts severe limits on how useful it would be for big data protection. Earlier this year, though, Microsoft released a paper claiming it had dramatically accelerated the speed of homomorphic encryption. Its neural network system made 51,000 predictions per hour while studying a stream of encrypted image data, with an accuracy level of 99%.

Microsoft researcher, Professor Kristin Lauter, later told The Register that while there was clearly a lot of work still to be done, the results looked promising. And while she stressed this was still purely a research effort and Microsoft had no plans for a commercial product based on the technology she added: “We are definitely going towards making it available to customers and the community”. Some of the code has already been released as an open source project.

Given that homomorphic encryption may be the only way to unleash the full potential of big data without compromising people’s privacy and security, it seems unthinkable that as soon as it’s sufficiently mature it won’t quickly become the default means of data protection for big data. We’re not quite there, but expect to hear a lot more about the technology in the next few years.

Let 100TB share your big data burden.


Originally published at blog.100tb.com.