We live in times where we aggregate a lot of user-generated data sets. We use these data sets to learn new insights and build solutions for all kinds of problems. Google uses user location information, tracked through smartphones, to infer how busy it is at certain public places. We use large scale patient data to get insights into certain disease trajectories, helping to cure future patients more effectively. Data-driven business models are being developed to deliver new products or services to customers in virtually every industry. However, there is a privacy issue at play here. How can we make sure that no personal information leaks in the final output of these algorithms? How do we evaluate ‘how good’ the privacy-preserving measures are implemented? This is differential privacy. It is a mathematical definition on the privacy of personal information used in big data algorithms. The definition enables us to quantify exactly how anonymous data is in a data set.
Explain It Like I’m 5
Your parents know which candy you like. Let’s assume you want nobody to know which candy you exactly like or dislike. The ability to keep this type of information a secret is called privacy. Now imagine you go to school and the teacher wants to know whether there are children in your class that does not like gummy bears. She does not want to know if you specifically are one of these kids that don’t like gummy bears. It is sufficient if she can tell that there are maybe 6 or 7 children who don’t like gummy bears. This type of privacy is called differential privacy.
Mathematical Privacy Guarantees
Consider the following:
- A is an algorithm that is implemented in such a way that it offers differential privacy;
- D and D’ are two data sets that differ on at most one row, the row that contains John’s personal information. It is important to understand that a data set contains general information and private personal information. The general information is the data of the entire population used in the set, as proposed to the private information in the set of any individual or group of individuals;
- O and O’ are respectively the outputs of A(D) and A(D’).
Differential privacy offers us the mathematical guarantee that anyone, seeing the results O or O’, will essentially learn exactly the same things about John’s private information. Or in other words, we should not be able to infer whether John’s private information has been used in the computation. We should learn exactly the same things from O as we do from O’, namely the general information that is contained in the data set.
However, differential privacy does not guarantee to keep John’s general information private. Information that is generally available is not guaranteed to be kept private in a differentially private algorithm. Therefore, when designing a differentially private algorithm, it is important to clearly understand what is general information (known publicly) and what exactly is private information.
Why we want Differential Privacy
Differential privacy has roughly four interesting properties that allow us to reason and protect the privacy of personal information in large data sets .
- It allows us to quantify the loss of privacy. It enables us to compare different data processing algorithms. We can control exactly how much privacy we want to lose. We have a way to choose a trade-off between the accuracy of the general information contained in the output and privacy loss.
- Composition of DP building blocks. We can analyze the cumulative privacy loss over several computations or differential private building blocks. Since we have a mathematical way of understanding exactly how much privacy we lose in a single Differential Private building block.
- Understanding Group Privacy. Differential privacy gives us a tool to understand and control privacy loss incurred by certain groups (f.e families, a group of friends, employees, etc).
- Insensitivity to Post-Processing. A malicious actor, that has no information on the private data set that is used, is in no way capable of computing a function of the output to make it less differentially private.
We went from an intuitive understanding of privacy to a more formal way of defining differential privacy and why this is useful to have. Since this is a LinkedIn article, I won’t bother with mentioning the actual mathematical notation of differential privacy. It is left as an exercise to the reader in the references. Whenever we design algorithms that compute results on large data sets, composed of personal information, we need to ensure no personal information is leaked. Differential privacy is a new and upcoming field, it will be interesting to see how it will fit and develop over time.
 The algorithmic foundations of differential privacy. Cynthia Dwork and Aaron Roth. Foundations and Trends in Theoretical Computer Science, 9(3 4):211–407, 2014.
 Differential Privacy: A Primer for Non-technical Audience. Kobbi Nissim et al. February 14, 2018.