Introduction to Differential Privacy

Brooke Joseph
4 min readJan 31, 2023

--

Pure Differential Privacy

Differential privacy (DP) is a concept that uses math to make privacy guarantees. Companies and governments use it to protect their data. For example, the US Census Bureau started using it to protect the privacy of US citizens that participate in surveys.

The more elegant way of defining this using math is what I attached below.

This says that for an outcome S, the probability that, that outcome belongs to a dataset is only e to the epsilon away from the probability of that same outcome occurring in a neighbouring dataset. Where a neighbouring dataset is a dataset that only differs by one point from another. This essentially tells us whether or not a person participates, the probability of that outcome won’t be affected. Kinda cool! Let’s break this down to really understand the power of this.

So what’s this whole epsilon thing? This is called the privacy budget, which essentially tells us how much privacy you'll be getting. When using DP there is always this trade-off between the amount of privacy and the accuracy of the output. When something is more private, it will have less accuracy and vice versa. When the epsilon value is smaller, this will give more privacy, but less accuracy and the other way around. The previous math definition I showed, comes from the idea of taking the ratio between the probability of the event occurring with the dataset and its neighbour and bounding it by the epsilon value.

This says that the logarithmic ratio between the two probabilities will never be greater than this epsilon value. This is where the idea of the privacy budget comes in.

Approximate Differential Privacy

All the things above describe (ϵ)-differential privacy or in other words pure DP. There is something called approximate DP that is a little more relaxed and makes room for the privacy guarantee to fail. This can be represented with the following definition.

Here there is this parameter delta. This value tells us the probability that the privacy guarantee will fail. This makes it possible to understand the worst-case scenario. When a mechanism meets this definition we can call it (ϵ,δ)-differentially private.

How is privacy achieved?

The main way this privacy guarantee is met is by adding noise to the data. There are various mechanisms that are used to guarantee privacy. I will go through the main three mechanisms below.

The Mechanisms

Laplace Mechanism

The Laplace Mechanism is one of the most common ways of adding noise to a numerical dataset. It's pretty intuitive, but the Laplace Mechanism uses the Laplace distribution to add noise. The Laplace distribution takes two major things into consideration, the location and the scale (b).

When adding the noise to the dataset, it takes two things into account, the L1 sensitivity and the epsilon value. The L1 sensitivity is found by taking the absolute maximum difference between the output of a dataset and the output of its neighbour dataset. This gives information on how a small change in the input affects the output of the model.

This here is just showing how the noise is added. It takes in the sensitivity and the privacy budget and raises it to some exponent.

Guasain Mechanism

Similar to the Laplace Mechanism, the Gaussian Mechanism uses the Gaussian distribution to add noise to the dataset. One of the major differences between the Gaussian and Laplace Mechanisms is that the Gaussian Mechanism uses the L2 sensitivity rather than the L1 sensitivity. They perform similar tasks, however, calculated in different ways.

Additionally, this mechanism doesn’t guarantee pure differential privacy, however, it does guarantee (ϵ,δ)-differential privacy.

Picking between the Laplace Mechanism, and the Gaussian Mechanism depends on what level of privacy-accuracy tradeoff you’re trying to achieve. It also depends on the type of data you’re working with. The Laplace Mechanism is typically better with discrete data and the Gaussian Mechanism is better with continuous data.

Exponential Mechanism

The Exponential Mechanism is a little different, rather than taking in numerical values, it works more with categorical information. It does this by working with something called a score function and the L1 sensitivity. The score function takes in two values, the inputs (x) and a value (h) from the set of possible outcomes (H).

These are the main mechanisms that are used to meet the privacy guarantee. There are, however, some others, that I didn’t touch on. In another article, I’m going to tackle Advanced and Basic composition. This is the idea of combining multiple mechanisms to have a stronger privacy guarantee.

--

--

Brooke Joseph

My name is Brooke Joseph, I am a 18 year old girl who loves math, and her dog :)