Differential Privacy Basics Series Conclusion and Important list of resources.

Summary: This the sixth and the FINAL blog post of “Differential Privacy Basics Series” — summarizing all of the previous blog posts. For more posts like these on Differential Privacy follow Shaistha Fathima on twitter.

Differential Privacy Basics Series

Before we head towards the conclusion — let’s have a look at some of the properties of Differential Privacy.

Qualitative Properties of Differential Privacy (DP)

  • Automatic neutralization of linkage attacks, including all those attempted with all past, present,and future datasets and other forms and sources of auxiliary information.

Linking attacks: A linking attack involves combining auxiliary data with de-identified data to re-identify individuals. In the simplest case, a linking attack can be performed via a join of two tables containing these datasets.

Simple linking attacks are surprisingly effective:

1- Just a single data point is sufficient to narrow things down to a few records.

2- The narrowed-down set of records helps suggest additional auxiliary data which might be helpful.

3- Two data points are often good enough to re-identify a huge fraction of the population in a particular dataset.

4- Three data points (gender, ZIP code, date of birth) uniquely identify 87% of people in the US.

  • Quantification of privacy loss : Differential privacy is not a binary concept, and has a measure of privacy loss. This permits comparisons among different techniques:

(i) For a fixed bound on privacy loss, which technique provides better accuracy?

(ii) For a fixed accuracy, which technique provides better privacy?

  • Composition: The quantification of loss also permits the analysis and control of cumulative privacy loss over multiple computations. Understanding the behavior of differentially private mechanisms under composition enables the design and analysis of complex differentially private algorithms from simpler differentially private building blocks.
  • Group Privacy: DP permits the analysis and control of privacy loss incurred by groups, such as families.
  • Closure Under Post-Processing: DP is immune to post-processing. A data analyst, without additional knowledge about the private database, cannot compute a function of the output of a differentially private algorithm M and make it less differentially private. That is, a data analyst cannot increase privacy loss, either under the formal definition or even in any intuitive sense, simply by sitting in a corner and thinking about the output of the algorithm, no matter what auxiliary information is available.

Granularity of Privacy- A final remark on DP definition.

Differential privacy promises that the behavior of an algorithm will be roughly unchanged even if a single entry in the database is modified. But what constitutes a single entry in the database?

For example a database that takes the form of a graph. Such a database might encode a social network: each individual i∈[n] is represented by a vertex in the graph, and friendships between individuals are represented by edges.

This brings us to two situations:

(i) DP at a level of granularity corresponding to individuals.

This would require DP algorithms be insensitive to the addition or removal of any vertex from the graph. This gives a strong privacy guarantee, but might in fact be stronger than we need.

The addition or removal of a single vertex could after all add or remove up to n edges in the graph. Depending on what it is we hope to learn from the graph, insensitivity to n edge removals might be an impossible constraint to meet.

(ii) DP at a level of granularity corresponding to edge.

This would require DP algorithms to be insensitive only to the addition or removal of single, or small numbers of, edges from the graph. This is of course a weaker guarantee, but might still be sufficient for some purposes.

That is, if we promise ε-differential privacy at the level of a single edge, then no data analyst should be able to conclude anything about the existence of any subset of (1/ε) edges in the graph.

In some circumstances, large groups of social contacts might not be considered sensitive information. For example, an individual might not feel the need to hide the fact that the majority of his contacts are with individuals in his city or workplace,because where he lives and where he works are public information.

Similarly, there might be a small number of social contacts whose existence is highly sensitive. For example, a prospective new employer, or an intimate friend. In this case, edge privacy should be sufficient to protect sensitive information, while still allowing a fuller analysis of the data than vertex privacy.

Edge privacy will protect such an individual’s sensitive information provided that he has fewer than (1/ε) such friends.

Another example, a differentially private movie recommendation system can be designed to protect the data in the training set at the “event” level of single movies, hiding the viewing/rating of any single movie but not, say, hiding an individual’s enthusiasm for cowboy westerns or gore, or at the “user” level of an individual’s entire viewing and rating history.

Summarizing — answer to what, why, when, where , how?

What is Differential Privacy?

That is, if the effect of adding or removing an individual’s data is high on the output of the query then it means that the data has high sensitivity, and the chances of an adversary being able to analyze it and find some auxiliary information is high. In other words, the privacy is compromised!

In order to avoid data leak, we add a controlled amount of statistical noise to obscure the data contributions from individuals in the data set.

When training a AI model, noise is added while ensuring that the model still gains insight into the overall population, and thus provides predictions that are accurate enough to be useful. At the same time making it tough for the adversary to make any sense from the data queried!

Why do we use Differential Privacy?

Sometimes, AI models can memorize details about the data they’ve trained on and could ‘leak’ these details later on. Differential privacy is a framework (using math) for measuring this leakage and reducing the possibility of it happening.

When and Where can we use Differential Privacy?

How can we use Differential Privacy?

PATE approach at providing differential privacy to machine learning is based on a simple intuition: if two different classifiers, trained on two different datasets with no training examples in common, agree on how to classify a new input example, then that decision does not reveal information about any single training example. The decision could have been made with or without any single training example, because both the model trained with that example and the model trained without that example reached the same conclusion.

For better theoretical understanding and explanation: Privacy and machine learning: two unexpected allies?

For practical code example: Detecting an Invisible Enemy With Invisible Data!

Some Great Resources!!




Now this is not related but you might find it interesting — Podcast by The Changelog especially on Practical AI

Overall References for this series:

Thanks for following through till the end of this series. Feel free to post any comments or start a discussion about differential privacy concepts. You may also check the other series I have written before this :

ML Privacy and Security Enthusiast | Research Scientist @openminedorg | Computer Vision | Twitter @shaistha24