We are starting a series of blog posts on DP-SGD that will range from gentle introductions to detailed coverage of the math and of engineering details in making it work.
In this first entry, we will go over the DP-SGD algorithm focusing on introducing the reader to the core concepts, without worrying about mathematical precision or implementation just yet (they will be covered in future episodes). The intended audience for this article is someone who has experience with training ML models, including deep nets via backpropagation using their favorite framework (PyTorch, of course 🙂).
Privacy in ML
We know that training a model is an attempt at induction: we learn something from our data and we plan to use it to predict something else in the future. To state the obvious plainly, this means that there is some information in our dataset, and by training a model we condense at least some of it into an artifact we plan to use later. We learn in Machine Learning 101 that memorization can happen, so it’s perhaps not surprising that memorization can indeed be exploited to extract information about training data from a model (see eg [Carlini et al, 2018], [Feldman 2020]).
What is privacy, anyway? Let’s say we don’t know and let’s start fresh by looking at our problem. We can all agree that if the ML model has never seen your data in the first place, it cannot possibly violate your privacy. Let’s call this our baseline scenario. Intuitively, if you now added your own data to the training set and the resulting model changed a lot compared to the baseline, you would be concerned about your privacy. While this makes intuitive sense, the real world is more complex than this. In particular, there are two problems we should think about:
- We know that any tweak in the training process, no matter how trivial, will change the resulting model significantly. Permuting the training data, rerandomizing initial parameters, or running another task on the same GPU will produce a different model with potentially very different weights. This means that we can’t simply measure how different the weights are in these two scenarios as that will never work.
- If everyone expected that absolutely no change would happen in a model if they added their data, it means that there would be no training data and hence no ML models! We can see that this constraint is a bit too rigid.
Luckily for us, this was figured out by [Dwork et al, 2006] and the resulting concept of differential privacy provides a solution to both problems! For the first, rather than comparing the weights of the two models, we want to consider the probabilities of observing these weights. For the second, instead of insisting that nothing will change, let’s instead promise that while something will change, we guarantee it will never change by more than a specific and predefined amount. This way, we won’t learn too much to be nosy, but we can still learn enough to produce useful models.
These two principles are embodied in the definition of differential privacy which goes as follows. Imagine that you have two datasets D and D′ that differ in only a single record (e.g., my data) and you interact with the data via a process or mechanism called M (this can be anything, more on this later). We can say that M is ε-differentially private if for every possible output x, the probability that this output is observed never differs by more than exp(ε) between the two scenarios (with and without my data).
Or, if you prefer a formula:
∀ D and D′ that differ in one person’s data ∀ x: ℙ[M(D) = x] ≤ exp(ε) ⋅ ℙ[M(D′) = x]
One of the amazing things about differential privacy is that it imposes no limitations on the nature on M. It can be anything. It can be a database query, it can be asking a set of questions with pen and paper to a person, or even just storing it to disk or sending it over wire, or anything else you want. As long as M enjoys this property over its outputs, then it can claim its DP badge for a specific privacy budget ε. At the same time, you can choose what ε you want to be certified for: the higher it is, the less private you are (look at the formula: it means the probabilities are allowed to diverge more). For this reason, the quantity ε is commonly referred to as the privacy [loss] budget.
If we go back to our case of training a model, we now have a way to formally certify the privacy level of our training algorithm: if, after training two models, one of which on all data (mine included) and the other on all data except from mine, we can prove that all weights of the two models are observed with probabilities that lie within a predefined boundary of exp(ε) of each other, then we can claim the cool DP badge for our training procedure (that’s right! It’s the overall process that gets the badge, not the data and certainly not the trained model!).
Notice that this task is harder than it looks: we can’t simply try 1000 examples (or a million, or a billion) and check whether they match. We need to prove this for all values, including never previously observed ones. The only way out of this is math and theorems. The good news about this is that if somehow we do manage to get this done, then we know that no matter what, the privacy claim is always true. There can never be any future attack that will extract our precious data from a trained module, nor any bugs to exploit to circumvent our defense just like you can’t break Pythagoras’s theorem, so this is why it’s worth doing.
Providing a guarantee
So, how do we provide this guarantee then? The definition doesn’t say anything about the how.
It’s helpful to think about this problem on a simpler domain, so for now let us leave machine learning aside and focus on making private counting queries to a database — at the end of the day, we can see ML training as a special case of querying the training data to get numerical results out.
It is trivial to see that
COUNT(*) WHERE <cond> queries can lead to a complete privacy breakdown against a sufficiently determined attacker. Consider the following example of a database that consists of two fields
salary, with the latter being kept “private” by mandating it can only be shown in aggregates. By repeatedly running queries such as
COUNT(*) WHERE name="Alice" and salary < X, Alice’s salary can be recovered with binary search. Can we defend against this attack by disallowing queries that target individuals? If only! A pair of queries
COUNT(*) WHERE name<>"Alice" and salary < X and
COUNT(*) WHERE salary < X get the job done just as easily as before.
It may seem that these simple attacks can be thwarted by making the server’s answers a bit less precise. For instance, what if the server rounds its responses to the closest multiple of 10? Or, to confuse the attacker even more, chooses the rounding direction randomly?
A seminal result from the early 2000s due to Irit Dinur and Kobbi Nissim states, loosely, that too many accurate answers to too many questions will violate privacy almost surely. This phenomenon is known in the literature as Fundamental Law of Information Recovery and has been practically demonstrated in a variety of contexts time and time again. It effectively means that not only the answers cannot be overly precise, the error must grow with the number of answers if we want to avoid nearly total reconstruction of the dataset.
The notion of differential privacy turns these observations into actionable guidance.
The remarkable fact is that we can enforce differential privacy for counting queries by simply computing the precise answer and adding noise randomly sampled from a carefully chosen probability distribution. In its simplest form, a privacy-preserving mechanism can be implemented with a noise drawn from the Laplace distribution.
Of course, by asking the same query multiple times, the additive noise will average out and the true answer will emerge, which is exactly what Dinur-Nissim warned us about. Take that, differential privacy!
Differential privacy allows us to analyze this effect too, and in a very neat way: if you take a measurement from a mechanism with privacy budget ε₁ and a second measurement from another mechanism with privacy budget ε₂, then the total privacy budget will be simply ε₁+ε₂. Sleek, eh? This property is called (simple) composition. This means that if the mechanism guarantees that a single query has ε=1 and you want to issue three queries, the total privacy budget expended will be ε=3.
This “just add some noise” business sounds too good to be true, right? What if the attacker thinks really hard about the output of a differentially private computation, such as feeding it into a custom-made neural network trained to break privacy? Fear not! Differential privacy is preserved by post-processing, which means that results of running arbitrary computations on top of differentially private output won’t roll back the ε. Not only does it protect against clever adversaries, it gives us a lot of flexibility in designing differentially private mechanisms: once differential privacy is enforced anywhere in the data processing pipeline, the final output will satisfy differential privacy.
To recap, we learned that our solution will look like this:
- Our mechanism will be randomized, i.e., it will use noise.
- Our final privacy claim depends on the total number of interactions with the data.
- We can post-process results of a differentially private computation any way we want (as long as we don’t peek into the private dataset again).
Back to machine learning
To apply the concept of differential privacy to the original domain of machine learning, we need to land on two decisions: how we define “one person’s data” that separates D from D’ and what the mechanism M is.
Since in most applications of ML the inputs come without explicit user identifiers, with Federated Learning being one notable exception, we will default to protecting privacy of a single sample in the training dataset. We will discuss other options in future Medium posts.
As for the mechanism M, one possibility is to consider privacy of the model’s outputs only. This is indeed a valid option called private prediction, but it comes with many strings attached: the model can still memorize, so it’s up to your inference system to securely enforce those constraints. Also, this prevents us from ever releasing our ML model: if someone gets to see the weights, our privacy guarantees will be lost. This means that deploying on mobile will be considerably less safe, among others.
For this reason, it would be much preferable if we could instead insert the DP mechanism during model training, so that the resulting model could be safe for release. This brings us to the DP-SGD algorithm. (There is evidence that even when you only care about accuracy, private training still beats private prediction. See [van der Maaten, Hannun 2020] for a practical analysis and more discussion on the topic).
DP-SGD (Differentially-Private Stochastic Gradient Descent) modifies the minibatch stochastic optimization process that is so popular with deep learning in order to make it differentially private.
The core idea is that training a model in PyTorch can be done through access to its parameter gradients, i.e., the gradients of the loss with respect to each parameter of your model. If this access preserves differential privacy of the training data, so does the resulting model, per the post-processing property of differential privacy.
There is also an engineering angle here: since the PyTorch optimizer is already made to look at parameter gradients, we could add this noise business directly into it and we can hide away the complexity, allowing anyone to train a differentially private model simply. Profit!
This code sample can show how simple this is:
We have only one question left: how much noise should we be adding? Too little and we can’t respect privacy, too much and we are left with a private but useless model. This turns out to be more than a minor issue. Our ambition is to guarantee that we respect the privacy of each and every sample, not of every batch (since these aren’t a meaningful unit privacy-wise). We’ll cover the details in a future installment of this series, but the intuition is very straightforward: the right answer depends on the largest norm of the gradient in a minibatch, as that is the sample that is at most risk of exposure.
We need to add just enough noise to hide the largest possible gradient so that we can guarantee that we respect the privacy of each and every sample in that batch. To this end, we use the Gaussian mechanism that takes in two parameters, the noise multiplier and the bound on the gradient norm. But wait… The gradients that arise during training of a deep neural network are potentially unbounded. In fact, for outliers and mislabeled inputs they can be very large indeed. What gives?
If the gradients are not bounded, we’ll make them so ourselves! Let C be the target bound for the maximum gradient norm. For each sample in the batch, we compute its parameter gradient and if its norm is larger than C, we clip the gradient by scaling it down to C. Mission accomplished — all the gradients now are guaranteed to have norm bounded by C, which we naturally call the clipping threshold. Intuitively, this means that we disallow the model from learning more information than a set quantity from any given training sample, no matter how different it is from the rest.
This requires computing parameter gradients for each sample in a batch. We normally refer to them as per-sample gradients. Let’s spend a little more time here as these are a quantity that is normally not computed: usually, we process data in batches (in the code snippet above, the batch size is 32). The parameter gradients we have in
p.grad are the average of the gradients for each example, which is not what we want: we want 32 different
p.grad tensors, not their average into a single one.
Computing per-sample gradients like in the snippet above seems slow, and it is as it forces us to run backward steps for one example at a time, thus losing the benefit of parallelization. There is no standard way around this as once we look into
p.grad, the per-sample information will have been already lost. It is however at least correct — a batch gradient is a per-sample gradient if
batch_size=1. This method is called the microbatch method and it offers simplicity and universal compatibility (every possible layer is automatically supported) at the cost of training speed. Our library, Opacus, uses a different method that is much faster, at the cost of doing some extra engineering work. We will cover this method in-depth in a followup Medium. For now, let’s stick to microbatching.
Putting it all together, we want to:
- Compute the per-sample gradients
- Clip them to a fixed maximum norm
- Aggregate them back into a single parameter gradient
- Add noise to it
Here’s some sample code to do just that:
This already gives a good idea of how to implement the DP-SGD algorithm, although this is clearly suboptimal and (as we shall see) not fully secure. In future Medium posts, we will cover how we bring back parallelization to DP-SGD, add support for cryptographically secure randomness, analyze the algorithm’s differential privacy, and finally train some models. Stay tuned!