Differential Privacy

Bryant Wolf
5 min readDec 14, 2018

--

Differential privacy is an amazing win/win for both privacy advocates and data scientists. Done correctly, companies and governments can learn about populations, without knowing anything concrete about individuals. Historically, there are things that are difficult to survey because there might be personal, professional or legal consequences for respondents telling the truth. Other times consumers have been forced to make huge privacy/utility trade-offs.

We are right to be cautious. What is acceptable and legal today, may not be tomorrow. Additionally, there are plenty of examples of data breaches that may make people less likely to give certain information, even to trusted groups, because it might someday be ?inadvertently exposed.

Differential privacy solves this problem by giving every individual plausible deniability that any answer they provide might not actually be true. It sounds counter intuitive but giving less accurate information can actually get more accurate results.

While some implementations rely on a trusted curator to add noise to a data set, this piece focuses on implementations that allow those submitting answers to add the randomness themselves. They could do so by manually flipping coins or they could have local software to handle it on their devices before any information is ever sent up to somebody else.

The canonical example goes like this:

A surveyor wants to know what percent of the population has ever been asked to do something illegal by their current boss. There are clear instances where this information getting back to a respondent’s superior could have negative consequences down the road. The surveyor asks the respondent to do a simple process secretly before answering the question:

Only half of all respondents tell the truth. A surveyor giving this to 1000 people would expect ~250 respondents to say that they had been told to do something illegal by their current boss even if absolutely everyone behaved properly. If say 5% of people truly had been taken advantage of, they would see somewhere around 25 additional people reporting in.

The raw results would looks something like this:

But we know that statistically ~50% of that data is junk data. It is simply the result of people flipping coins. We can extract that part of the data and we’re left with a sample of reality.

With only 25/275 people actually true positives, it would be easy for any individual to have deniability if their answer was exposed to their superior. They could easily claim that they simply followed the process and flipped tails, then heads, requiring them to provide a “Yes”. They have even more deniability if the process is automated. Furthermore, scorched earth style, punishing of everyone who responded affirmatively, is impractical because they are such a large part of the whole group. The herd protects the weak.

There are some clear limitations here. Learning about small groups is still difficult because you need to make sure that you get some statistical significance. If you ask a group of 100 people if they’ve ever tried heroin and only 5% of people actually have, you’re going to end up with about 27–28 yeses. It would be impossible to tell if those 2 or 3 additional yeses are actual evidence of heroin use or just noise because a couple extra people than expected flipped tails followed by heads.

This problem can quickly compound if you’re using simple methods like these to ask many questions about people. If the surveyor asks 3 questions, knowing that the respondents can end up providing false information about any number of them, they may need to increase the sample size to increase the signal to noise ratio. To lessen the sample size requirement, the coin doesn’t need to be balanced. A coin that tells the truth 75% of the time and lies 25% may offer sufficient deniability.

All these examples assume that the surveyor is only asking a question once. However online tools may be asking questions all the time. They fall roughly into three buckets that need to be handled differently in order to protect the privacy of the respondent.

If the surveyor asks questions over time that are totally independent of each other, then no adjustment is necessary. Respondents asked to roll a fair die and report the result in a differentially private manner do not expose themselves more over time.

Some questions, on the other hand, will never change. Asking for someone’s blood type in a weekly report would eventually be deanonymizing. The true answer would show up more than the false answers and eventually a surveyor would be confident they know the respondent’s real blood type. Respondents can resist this by going through the coin tossing process once and simply reporting that answer every time it’s asked in the future.

Some answers are neither constant, nor uncorrelated, and these offer more of a challenge. A surveyor asking for the respondent’s location everyday would start to get some information over a long enough timeline even if not all answers were truthful. Research is still ongoing into ways to properly anonymize this class of data. This is timely with the news about just how many location data brokers there are, and just how unaware the public is about what can be done with that data. For example, 95% of Americans can be identified by name from just four time/date/location points.

Data Scientists use a variable called Epsilon (ε) to measure what the worst case privacy scenario is based on what data is given up. Data with no noise inserted is considered an infinite score, whereas data in which 100% of respondents simply flipped a coin and gave one answer on heads and the other on tails, gives up no information for a score of zero. There is no easy global ε value that we should apply to everyone. Like risk tolerance, it’s something that might differ widely among people. That said, knowing what standard different services require could be a great tool to put privacy in the hands of people, allowing them to make safe rational choices with their data.

Apple and Google, despite their sometimes opposing incentives on privacy, are both actively researching and implementing differential privacy. The US 2020 Census is also on board. While not expecting respondents to answer in a differentially private manner, they are acting as a trusted curator and only releasing data for review that has had noise inserted in it to become differentially private. Long term, a census seems like a phenomenal example of a place where respondent inserted noise could do an incredible amount of good. Controversial questions around citizenship and drug use may finally be able to be asked in a safe yet informative way for the good of society. Governments, with better data, can make better decisions by understanding the actual needs of their constituents, without the tragedy of the commons problem with respondents exposing sensitive data about themselves.

--

--

Bryant Wolf

Founder @rumevideo Team member @subforum , Developer, formerly @hellosign, went to @georgetown