To Reclaim Data Anonymity, Give Up The Proofs
Researcher Paul Francis tells Teb’s Lab why we might have to give up on mathematically certain methods of data anonymization in order to advance the field.
Privacy and data anonymity have been hot topics over the last few years. The adoption of regulations such as General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) as well as the challenges of getting such legislation right has spurred some of this interest. Scandals like the Cambridge Analytica kerfuffle that cost Facebook $5 billion dollars in FTC fines and the seemingly endless stream of data breaches have also driven public interest in data privacy.
Breaches and leaks are only one part of the conversation around privacy. In the case of a data breach, malicious actors obtain raw data complete with identifying features. Another part of the problem lies in anonymizing data that was always meant to be shared. In these cases the question is not so much how to prevent people from getting the data, but how to protect individual identities in the data sets that people receive. A chorus of voices have declared that data anonymity is almost impossible to achieve. I explored this perspective when I wrote Why ‘Anonymized Data’ Isn’t So Anonymous for OneZero in April and I have to admit: I came away feeling disheartened with the state of privacy preservation.
On the other hand, while academics and researchers have demonstrated successful attacks against some anonymizing tactics, reports of actual instances of malicious actors breaching strong anonymity protections are relatively uncommon, especially compared to the overwhelming number of reports about leaked and otherwise hacked databases. National census bureaus have been releasing anonymized data covering billions of individuals for decades with little evidence that their anonymity has been broken for malicious purposes.
I spoke with Paul Francis — an anonymity researcher for the Max Planck Institute for Software Systems and co-founder of the data anonymity company Aircloak — about this apparent contradiction and about the new privacy preservation strategies they are employing at the Max Planck Institute and Aircloak. According to Francis, one of the biggest problems facing data privacy advocates is convincing ourselves that anonymity can actually be preserved.
The following interview has been edited and condensed for clarity.
Tyler Bettilyon: Historically, most efforts to anonymize data sets are more accurately called pseudonymization. This distinction is important both in terms of regulation and in practical applications. Can you describe the difference between these two labels?
Paul Francis: When I say anonymized I tend to use it in the strong GDPR sense. The GDPR has defined the term anonymous to mean data that was personal data but has been anonymized is no longer personal data. When they say anonymous they mean something really quite strong: That a determined and resourceful attacker would not be able to get individually identifying data out of the anonymized data. And that’s actually a pretty big deal because if your data is “non-personal” it basically exempts you from the GDPR.
So we think that if anybody can get that status confidently, that’s a fairly big win. It’s also natural for me to want to use anonymous in this strong sense because Aircloak targets that strong level of anonymity.
Then there are other terms like pseudonymous, which is also used in the GDPR. Pseudonymized is when you remove identifiers such as names and so on from a data set. A lot of people think anonymizing data is nothing more than removing identifiers. But GDPR explicitly defines this as pseudonymized, and they are clear that pseudonymized data is not anonymous data: It is still subject to the GDPR rules.
Bettilyon: Is that because the regulators recognize how weak that form of protection is?
Francis: Yes, and because of a common misunderstanding that removing identifiers is all you really need to do. But, they do encourage people to do it anyway. They say if you are going to share data that you can’t anonymize for some reason, at least pseudonymize it.
So what makes your tools different? What makes the resulting system anonymous and not pseudonymous?
It’s a fairly complex thing technically, but the first thing is that we are a dynamic anonymization scheme. There’s a database with the raw data and our software sits in front of that database. Our system exposes a SQL interface, and an analyst can make SQL queries to our system which will then interact with the raw data and provide an answer. So, it’s not a static anonymization process where are you just remove identifiers to create a new data set.
The fact that the raw data still lives somewhere is what it is. It would be lovely if you could just get rid of all the raw data, of course. So, there are trade-offs with our approach, but this is how we do it.
The second thing is the reason we do it this way: We just don’t see how you can do static anonymization and still end up with useful data. The classic approach for static anonymization is K-anonymity which is where you change values to a range instead of a specific data point, like an age of 12–20 instead of a specific age like 15. But, if the data is at all complex then applying K-anonymity will essentially destroy the data.
There is an axiom in security that your security doesn’t need to be that strong, it only needs to be stronger than the next guy’s.
No one really knows how to do static anonymization except on a case-by-case basis. The Census Bureau for example releases statically anonymized data, but to do that they go through a very manual process of deciding how much they can get away with. For example, they have to choose how much to obscure each column. Plus, the census data is typically very simple. Most columns have just a small number of choices such as gender or race.
So, in the census data you’ve got a ton of people choosing from just a few options, and that means K-anonymity can be pretty effective at hiding people’s individual identities?
Yes, and they do some other things too. There’s a long history of these manual techniques that statisticians can use to release statically anonymized data and in some cases still produce relatively useful data. That’s fine and good, but it’s a very manual process that requires a lot of expertise and relatively simple data. It can be done in some cases but it’s easy to get wrong. Plus, most of the companies we deal with have much more complex data so that’s not an option for them.
In fact, census bureaus have a tremendous success record. They have been releasing data for decades, and as near as I can tell there are no reported malicious breaches of that data.
With Aircloak we apply a lot of the same tactics — flattening the outliers in the data, suppressing answers that have too few data-points, adding noise, and so on — but we do this on a query by query basis. We do a lot of the same things that people have been doing manually for decades, but we do it automatically.
It’s not quite that simple because we have to deal with a clever attacker using a series of queries to defeat the anonymity, but it’s useful to think of Aircloak this way. We have a set of anonymity preserving tactics and we apply them on a query by query basis so that you don’t have to go through this painful manual process.
Researchers have shown that a lot of commonly used anonymization tactics can be broken, but you’re saying that reports of data sets actually being deanonymized outside of the research community are quite rare. How can we reconcile this apparent contradiction? Are data thieves just really good keeping their breaches out of the news, or do you think there’s more to it?
I think there must be more to it, but neither I nor anybody else really knows for sure. It is hard to reason about why something that perhaps could happen doesn’t. I recently had a conversation with a census bureau technologist and he suggested two possible reasons. First, census data really isn’t that interesting — there is little in there that an attacker could really monetize. Second, there are so many easier ways to obtain data about people that it probably doesn’t make sense to try to deanonymize census data.
I would think, however, that if breaches of census data were common that it would be hard to keep it out of the news forever. After all, at some point, to monetize the data, the attacker would have to reveal to someone somewhere what he or she knows, and how he or she came to know it might naturally be discovered. In any event, I think that a malicious breach of census data would be a big news story.
There is an axiom in security that your security doesn’t need to be that strong, it only needs to be stronger than the next guy’s. Maybe that is what is at play in the case of census data. Of course other types of data, like financial or medical data, are more valuable than census data and so might be a more attractive target for the bad guys. But we just don’t have enough experience with breaches of anonymized data to know.
You mentioned the tension between anonymizing data and maintaining the analytical value of that data. When a data source has been anonymized are there any clear signs that that data has become unreliable or lost its utility?
This is a great question that people often don’t think about: How is bias introduced and how can people deal with that bias? How can people tell that the data is really bad without comparing it to the true source data? And there’s no simple answer to this, unfortunately.
But, I have heard of cases where this matters a lot. I don’t know if this is totally true but I’ve heard a story of a Ph.D. student who performed some analysis on anonymized data from the census bureau and found out later they had misunderstood the anonymization technique in a way that made their results useless.
You can say with mathematical confidence that something is differentially private, but the data often loses its utility in the process.
Even if this specific story isn’t true, stuff like that can definitely happen. It’s hard to do good data analysis under normal circumstances and people make mistakes all the time. It takes a lot of skill just to use raw data properly and anonymization only makes it harder. So, if you don’t really understand what’s going on under the hood you can make a big mistake.
This is a problem many anonymization mechanisms have: How can analysts have confidence that they’re getting good results when they use anonymized data. We’ve made progress with this challenge, but analysts still have to be very careful.
One concept that’s gained some popularity is differential privacy. You’ve expressed some skepticism about the it along these same lines, that using differential privacy often renders the data useless. Why is that?
At its core differential privacy is a mathematical formula and it’s essentially a goal. What differential privacy really means is you develop an anonymizing system whereby you can prove mathematically that the data matches one of these differential privacy equations. It’s a model with a lot of different mechanisms that people have built over the years. But, it’s a very pessimistic model. It essentially says you can’t win.
Differential privacy says: We don’t know what the attacker might do. In fact the attacker might do things we’ve never thought of. But, if you follow these equations and keep your epsilon — which is roughly your tolerance for risk — low enough it’s essentially impossible. Differentially private data is anonymous regardless of the attackers capabilities: It’s statistically impossible to infer individual information from the data. This is primarily achieved through randomization, meaning it’s impossible to break the anonymization in the same way that it’s impossible to tell what number some dice is going to roll.
Differential privacy also requires a query limit in order to keep the guarantee of privacy. Essentially, once you’ve asked some number of questions you can’t ask any more. If you asked another question, we wouldn’t be able to keep our epsilon low enough, so you’re locked out. Obviously, if you limit the number of questions you can ask a data set, it’s much harder for that data to be useful in an analytical sense.
This reminds me of situations in software security: sometimes researchers find vulnerabilities that could have serious consequences if exploited, but are nearly impossible to exploit in any real-world scenarios. Sometimes these kinds of vulnerabilities are left unpatched for a long time.
That’s an interesting point. Many people think of anonymity the same way they think of cryptography, which is kind of an all-or-nothing affair. You have a message and either you can break the encryption and get at the original data or you can’t. That’s not quite true, but let’s stick with the idea for now: We think of cryptography as being secure or insecure and there’s no middle ground. Many people want to think about anonymity the same way.
Differential privacy is a way to satisfy those people. You can say with mathematical confidence that something is differentially private, but the data often loses its utility in the process.
But your approach is to anonymity is more empirical as opposed to differential privacy’s proof-oriented approach.
Right, if we accept that something with mathematical proofs behind it is not going to give us the utility we want then we have to try something else. This concept is not unheard of by the way. The AES encryption standard does not have a proof behind it and it’s one of the most common methods we use to encrypt data and send it over the internet. There’s not a proof behind it, we just haven’t found an attack that works.
More and more I think of our work in anonymity in terms of how cryptography worked back in the 60s: We would try new ideas and then people would attack them. Eventually someone would break them, so cryptographers would try new ideas and over time we built up a good understanding of what worked and didn’t work. Ultimately, it’s very important that symmetric key cryptography is fast. There are well known cryptography methods that are stronger than AES, but they are too slow for many use cases.
Nobody wants to write a headline like, “Company Able To Do Anonymity a Little Bit Better.”
People gave up on mathematical proofs and certainty in order to go faster, and this is similar to what we have in anonymity. We want high utility in our data and this means we have to give up on the proof, just as we gave up the proof in encryption to gain speed. In most cases we will not get the utility we need from the data using these mathematically certain methods.
We’re trying to figure out how much privacy and how much utility people really need and how to push that boundary. From the get-go we knew that Aircloak is not going to provide mathematical proof of anonymity. We don’t try to make things differentially private. So, the big question that comes up is how do you know this is anonymized, and what does that even mean?
We don’t necessarily have a better answer to this question than anybody else at this point. Just like in the 60s you couldn’t take the latest algorithm and say, “oh yes that’s strong encryption could never be broken,” because they had to give up the proofs.
Is this part of why you’ve launched a bounty program? You’re asking people to try and break your anonymization tactics to get a better idea of how strong your system really is?
So, we also don’t have any way of knowing for sure that some data set or system is anonymous. A big part of the last five years of this project has been finding weaknesses and trashing the design if we find too many. It has been a constant process of trying to find and fix problems with the system. Eventually, we got to a place where we couldn’t find any more problems ourselves. You know, your thinking gets exhausted. You need fresh eyes.
One way to do this is to write an academic paper and have other academics look at the paper. This is kind of hard for us because of our empirical approach. In an academic environment they really expect a formal approach. So, we’ve gone over to a bounty approach in order to get more people working on breaking our system. If Aircloak is ever broadly used I do think academics will start to take an interest, but in the absence of that we’re trying to use the bounty program to get people to pay attention and find problems that we may have missed.
We had about 30 people-or-teams sign up to attack our system. In most cases we don’t hear anything back from them — so it’s hard to tell how much they worked at breaking it — but we did have two teams that found vulnerabilities and received payouts. In both cases we learned something new, but neither of the attacks were so bad that we had to tell our customers, “don’t use this until we fix it.”
We learned a lot from the attacks and we were able to generate successful fixes for those attacks. We’re hoping to run another round of the program late next year.
You’re also developing something you call the General Data Anonymity Score. How does that score fit into all this?
The GDA Score is an attempt to exploit what we learned in the bounty program, and to come up with a more general way of defining anonymity across technologies. The score measures when an anonymization system fails based on a number of attacks we already know about.
It’s early days. The score has not caught on yet and it’s quite hard to do. But, this is something we’re trying to push into the industry. Most companies that are building anonymity systems are not very forthcoming about how they are doing it. We’re developing the score to try and push them to be a little more open about what they do. Right now there are not many ways to ensure or assert the anonymity of a particular data set. Differential privacy is one way, and we’d like the GDA Score to become another choice.
That makes sense. Anyone can assert their system is anonymity preserving, but how can people really know without access to the system?
Right, unless people release tons of data using some system there’s not much to attack. Most of our use cases are completely internal. That typically means no one with access to the data are really trying to attack it. We’re not going to find any vulnerabilities this way because only trusted people can even access the anonymized data.
Again, this is very different from cryptography where you encrypt a message and you put it out in public where people can try to attack it. Anonymity is usually not like that because you can also control who has access to the data. If you’re unsure about the level of anonymity you can still limit access to the data.
Something that’s frustrating for us is: As we open up our system we’re trying to get people to attack the system in order to make it more secure. So, someone finds a bug and writes about it, and that’s good, we want to know about the bug. But then that information just sits out there. Even when we fix that particular issue people find the article when they search for us and they assume our system must still be broken. Similarly, people read articles saying data anonymity is impossible, so why try?
So we’re struggling with the whole situation.
I can understand that. I mean imagine if every program that ever had a CVE (Common Vulnerabilities and Exploits) opened against it received extreme scrutiny forever afterwards. We probably wouldn’t use any software at all.
Exactly. But for some reason that’s not how people think about anonymity. I mean, I do think that anonymity is currently kind of broken, but it’s much more subtle than that.
Plus, it’s easier to write a clickbait-ey headline like “Why ‘Anonymous Data’ isn’t so Anonymous.”
Right, nobody wants to write a headline like, “Company Able To Do Anonymity a Little Bit Better.”
Or, “Anonymity Partially Achievable”