Differential Privacy Definition

shaistha fathima
8 min readSep 15, 2020

--

Summary: Sometimes knowing the basics is important! This third blog post of “Differential Privacy Basics Series”, covering a quick and easy introduction to Differential Privacy Definition- Epsilon, delta, etc. For more posts like these on Differential Privacy follow Shaistha Fathima on twitter.

Differential Privacy Basics Series

What is Privacy?

Everyone has different opinion on what privacy means, but in terms of data, it could be, the right to control how information about you is being used, processed, stored, or shared.

In today’s information realm, an extensive sensitive personal information is being possessed by the daily services we tend to use, such as search engines, mobile services, on-line social activity and so on. The loss of data privacy is imminent since the guarantee of data privacy incorporates controlling access to information, controlling the flow or purpose of information. Thus, the need for Differential Privacy (DP) arises with the hope of a better and robust data privacy.

What is Differential Privacy?

As per the most accepted DP definition by Cynthia Dwork in her book Algorithmic Foundations of Differential Privacy

Differential Privacy describes a promise, made by a data holder, or curator, to a data subject (owner), and the promise is like this: “You will not be affected adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, datasets or information sources are available”.

Formal Definition of Differential Privacy:

Cynthia Dwork’s Formula:

Where,

M: Randomized algorithm i.e query(db) + noise or query(db + noise).

S: All potential output of M that could be predicted.

x: Entries in the database. (i.e, N)

y: Entries in parallel database (i.e., N-1)

ε: The maximum distance between a query on database (x) and the same query on database (y).

δ: Probability of information accidentally being leaked.

NOTE: This definition is valid for ONE QUERY in the database and not for multiple queries.

Key Points to be Noted:

  • This DP definition does not create DP, instead it is a measure of “How much privacy is afforded by a query?”
  • It gives the comparison between running a query M on database (x) and a parallel database (y) (One entry less than the original database).
  • The measure by which these two probability of random distribution of full database(x) and the parallel database(y) can differ is given by Epsilon (ε) and delta (δ).

(1) Epsilon (ε):

It is the maximum distance between a query on database (x) and the same query on database (y). That is, its a metric of privacy loss at a differential change in data (i.e., adding or removing 1 entry). Also known as the privacy parameter or the privacy budget.

(i) When ε is small

(ε,0)-differential privacy asserts that for all pairs of adjacent databases x, y and all outputs M, an adversary cannot distinguish which is the true database on the basis of observing the output. When ε is small, failing to be (ε,0)-differentially private is not necessarily alarming — for example, the mechanism may be (2ε,0)-differentially private.

All Small Epsilons(ε) Are Alike, i.e., the nature of the privacy guarantees with differing but small epsilons are quite similar.

(ii) When ε is large

Failure to be (15,0)-differentially private merely says that there exists neighboring databases and an output M, for which the ratio of probabilities of observing M conditioned on the database being, respectively, x or y, is large.

An output M, might be very unlikely (this is addressed by (ε, δ)-differential privacy); for databases x and y might be terribly contrived and unlikely to occur in the “real world”; the adversary may not have the right auxiliary information to recognize that a revealing output has occurred; or may not know enough about the database(s) to determine the value of their symmetric difference.

A smaller ε will yield better privacy but less accurate response.

Small values of ε require to provide very similar outputs when given similar inputs, and therefore provide higher levels of privacy; large values of ε allow less similarity in the outputs, and therefore provide less privacy.

Simply put, as ε is the maximum distance between a query on database (x) and the same query on database (y). When not much differential distance is present between the two query i.e., ε is small, and if, the inputs of the queries are very similar then the outputs will be very similar too. Therefore, the adversary may not be able to analyze it and get the right auxiliary information out of it. Thus, providing higher chances or level of privacy when compared to when ε is a large value.

Thus, much as a weak cryptosystem may leak anything from only the least significant bit of a message to the complete decryption key, the failure to be (ε,0)- or (ε, δ)-differentially private may range from effectively meaningless privacy breaches to complete revelation of the entire database.

Two types of ε are seen between two differentially private databases,

(i) Epsilon Zero (ε = 0) : The OUTPUT of a query for all parallel databases is the same value as the full (original) database. Often, if sensitivity = 0,then ε =0 too.

(ii) Epsilon One (ε = 1) : Maximum distance between all the queries would be 1 OR the maximum distance between 2 random distributions M(x) and M(y) is 1.

It can also be stated that ε or the privacy budget is an individuals limit of how much privacy they are ok with leaking.

The concept arises from the issue that even though no single operation being done on data reveals who the subjects might be, by combining the data from multiple trials, one might be able to figure out details about an individual. As a result, every person has a “privacy budget”.

Each time an individual is included in a differential privacy calculation, it eats away at their privacy budget (by ε ). Most calculations in Machine Learning can be done to give a differential privacy guarantee and a utility guarantee. This means that all these calculations can also be accounted for in the privacy budget.

Real World Examples

1. Apple and privacy budgets:

A recent issue arose with apple as news came out that they were using differential privacy in order to maintain their users’ privacy. They were running operations of the app data they were receiving from their devices and made sure to use values that would ensure the privacy budget for the users would not be exceeded. However, they broke the standard by resetting the privacy budget every day.

Since the privacy budget is a value that is used cumulatively for life, this became a problem. Users were now no longer guaranteed privacy because no one was keeping track of their overall privacy budgets, they just knew that no single day’s operations would result in a breach of privacy for a user.

2. 2020 Census:

The 2020 Census announced that it would be using differential privacy in order to maintain the privacy of the American people. Specifically, they will do so by releasing all statistics in a deferentially private way. It has not yet been determined how epsilon and randomization of the data will occur. This is considered a big step forward because previous methods for maintaining privacy were dependent on a bunch of tricks that caused the data to be convoluted.

This decision faces some backlash from the leaders of academic sub fields devoted to interpreting the census. These people believe that the increased privacy provided by the new methods is not worth the trade-off for accuracy. A key point to learn from this example is that society needs to make the decision of which point in the privacy-accuracy spectrum we want to beat.

(2) Delta (δ):

It is the probability of information accidentally being leaked. If δ= 0, we say that output M is ε-differentially private.

Typically we are interested in values of δ that are less than the inverse of any polynomial in the size of the database.

(i) The values of δ on the order of 1/∥x∥1 are very dangerous: They permit “preserving privacy” by publishing the complete records of a small number of database participants — precisely the “just a few” philosophy, where only the people considered to have greater privacy concerns are protected.

(ii) When δ is negligible or equal to zero: It is called ε-differential privacy or (ε,0)-differential privacy.

Note: ε is independent of the size of database, where as in case of δ, the chances of privacy leak might increase with the size of the database. Hence, ideally we would want to set δ value to be less than the inverse of the size of database.

This brings us to question — (ε,0) or ε -differential privacy VS (ε, δ)-differential privacy?

(ε,0)-differential privacy ensures that, for every run of the mechanism M(x), the output observed is (almost) equally likely to be observed one very neighboring database, simultaneously.

Where as,

(ε, δ)-differential privacy says that for every pair of neighboring databases x, y, it is extremely unlikely that,ex post facto the observed value M(x) will be much more or much less likely to be generated when the database is x than when the database is y. It ensures that for all adjacent x, y, the absolute value of the privacy loss will be bounded by ε with probability at least 1−δ.

So, what does it mean?

In case of, (ε,0)-differential privacy or ε-differential privacy , where δ =0, i.e., probability of data leak δ is to be zero. Thus, a deferentially private data set with distance of 1 or less will have very similar output.

Where as, in case of (ε, δ)-differential privacy, the chances of data leaks are possible, maybe because of higher sensitivity values or large database size with large rows, etc. Thus, if the value of ε is not small, then there is less likely of the outputs being the same.

We usually go for ε-differential privacy based mechanisms — like Laplace mechanism — not only because of its simplicity but also, because in real life chances of delta being high, i.e., data leak, is pretty low! (ε, δ)-differential privacy usually works well with large databases where chances of privacy leak is high and might not also have low sensitivity.

That’s all folks! In the next post of this series we will look into types of noise adding Mechanisms in Differential Privacy and their comparison.

Till then you may also check out my other beginner friendly Series:

--

--

shaistha fathima

ML Privacy and Security Enthusiast | Research Scientist @openminedorg | Computer Vision | Twitter @shaistha24