Privacy granted by maths

Published in

Publicis Sapient France

10 min readNov 6, 2019

Back in March 2019, the TensorFlow team announced the addition of TensorFlow Privacy, a Python library that implements differentially private optimizers. Wondering what I am talking about? I was also curious to discover what it is about so I started to dive in. I have found a lot going on under the hood and I am going to present it in this article.

The article is structured as follows (hot peppers indicate how spicy it gets):

first I introduce issues of privacy leaks encountered with deep learning 🌶
then I’ll talk about differential privacy and how it can help with privacy leaks 🌶🌶🌶
finally I’ll show the implementation of the previous concepts in TensorFlow 🌶🌶

tl; dr

I gave a presentation about this topic at DevFest Nantes 2019, check it out!

Privacy leaks 🌶

Some recent research papers have shown that deep learning models might leak sensitive information about data points in the training dataset. Let’s see these examples to better understand the problem. The first example concerns a facial recognition model and the second one a generative sequence model.

Facial recognition model

In this paper, authors designed strategies to attack a model that performs facial recognition. The goal is to demonstrate privacy violation of individuals present in the training set by accessing the trained model or by making prediction queries against the model. Either way, the adversary does not have access to any of the original training data.

In one of the settings described, which assumed that the adversary knows a label produced by the model, i.e. a person’s name, images in the training set were reconstructed.

Reference
Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. 2015. Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS ‘15). ACM, New York, NY, USA, 1322–1333. DOI: https://doi.org/10.1145/2810103.2813677

Generative sequence model

In this paper, authors show that generative sequence models might unintentionally memorize rare or unique sequences in the training set and leak them at prediction time. Rare or unique sequences represent possibly sensitive data. If such “secrets” are present in the training set of auto-completion models for exemple, their disclosure may happen either by accident or on purpose when entering certain text prefixes that causes the models to print unexpected text completions.

References
arXiv:1802.08232v3

Differential privacy 🌶🌶🌶

Differential privacy is a mathematical theory that is helpful in the scenarios described above and that, most of all, provides mathematical guaranties about the privacy guarantee achieved by an algorithm.

The way differential privacy addresses the problem of privacy protection is by providing privacy through a process, not by acting on input data directly. By evaluating the result of a process applied to two almost identical data sources, if the inputs differ only by one entry and the outputs of the process on these two inputs are almost the same, then we can conclude that the one different entry isn’t relevant to obtain the output. Hence, if nothing is learned on this individual its privacy is granted as nothing can leak out from the process. The idea is that the process learns something about a population without learning anything specific about the individuals included in the population. To better understand it, let’s consider an analyst that performs an analysis on two databases X and Y. X and Y are such that they only differ by one entry, and the analysis performed is represented by a generic algorithm M. M(X) and M(Y) are the outputs of M computed on X and on Y respectively.

https://speakerdeck.com/giulbia/privacy-granted-by-maths

This example is based on the formal definition of a differentially private algorithm that is originally stated in The algorithmic foundations of differential privacy by C. Dwork:

A randomized algorithm M: D→Range(M) is (ε, δ)-differentially private if for all S⊆Range(M) and for all x, y ∈ D such that ∥x − y∥ ≤ 1:
Pr[M(x) ∈ S] ≤ exp(ε) Pr[M(y) ∈ S] + δ
If δ = 0, we say that M is ε-differentially private.

Reference
Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 9, 3–4 (August 2014), 211–407. DOI=http://dx.doi.org/10.1561/0400000042

In this definition, M is a randomised algorithm that is defined on domain D and has image Range(M), the output ensemble. Then x and y are two adjacent databases that differ only for one entry (∥x-y∥ is the L1 norm), and Pr[M(x) ∈ S] stands for the probability of having output S when M is computed on x. Similarly, Pr[M(y) ∈ S] stands for the probability of having output S when M is computed on y.

The whole definition means that when the output of M(x) is close to M(y), M is not importantly impacted by the single entry that makes the difference between x and y, and so we can say the single individual is “protected”.

To reach this conclusion and interpret the definition, let’s simplify it by choosing δ=0:

Pr[M(x) ∈ S] / Pr[M(y) ∈ S] ≤ exp(ε)

Next, for the sake of simplicity, if ε≅0 ⇒ exp(ε)≅1:

Pr[M(x) ∈ S] / Pr[M(y) ∈ S] ≤ 1

So the ratio of the probabilities is bound by 1, i.e. the output of M(x) is close to M(y).

Choosing ε close to 0 is not mandatory, but the greater ε is, the greater distance is allowed between probabilities of resulting outputs. Also, δ represents the probability of failing ε-differential privacy, so you want to choose a very low value, as a rule of thumb you can choose δ≪1/n, where n is the number of entries in the database.

The advantages of having a formal definition of privacy are multiple:

Differential privacy has properties such as composability, group privacy and robustness to auxiliary information that can be conveniently exploited for use in machine learning
It’s easy to transform a deterministic function into a differentially private algorithm
The privacy guarantee achieved is quantifiable

I am not going to detail the first point because it requires to dig into the math of the previous definition and it is out of the scope for this article. Instead, I will focus on the last two points that are of interest when applying differential privacy in machine learning.

In order to make a deterministic function f(x) a differentially private algorithm, one can add random noise drawn from a Laplace or a Gaussian distribution:

M-Lap(x, f, b) = f(x) + Lap(0, b)
M-Gauss(x, f, σ) = f(x) + N(0, σ²)

More specifically, when σ² = 2∙ln(1.25/δ)∙(Δf)² / ε², the gaussian mechanism satisfies (ε, δ)-differential privacy. Δf is the sensitivity of function f defined as Δf = max |f(x)-f(y)|. Intuitively, Δf is the maximum difference between the output of f on two adjacent databases, see Additive noise mechanisms for more details.

Both mechanisms are pretty simple but present a few shortcomings described in this paper which suggests a generalisation of differential privacy to prevent them: Rényi differential privacy. The Rényi differential privacy (RDP) is based on Rényi divergence of two probability distributions and is parameterized by α>1, the order of the divergence. More specifically, Rényi divergence between the distributions of M(x) and M(y) is bounded by ε in (α, ε)-Rényi differential privacy. This definition can be extended to (ε, δ)-Rényi differential privacy which shares useful properties of differential privacy and makes RDP analysis of Gaussian noise particularly simple.

Reference
arXiv:1702.07476v3

In machine learning, the way to render the process of training a model differentially private is by making the stochastic gradient descent differentially private. This is basically achieved by adding Gaussian noise to the gradient at each training iteration. One more step is actually necessary: clip gradient before adding noise. Clipping gradient is common practice to avoid overfitting even when privacy is not a matter. Here, it is required to prove the differential privacy guarantee.

In this paper, you can find the pseudo-code of a differentially private stochastic gradient descent:

Reference
arXiv:1607.00133v2

TensorFlow Privacy 🌶🌶

TensorFlow Privacy is the implementation of differentially private stochastic gradient descent for TensorFlow. It is very recent: its latest release, version 0.1.0, was back in the beginning of October. Nonetheless, I could install it in avirtualenv without issues. Here is the requirements.txt that I used:

requirements.txt

In the TensorFlow Privacy Github page, there are a number of tutorials available. I ran the mnist_dpsgd_tutorial.py and had no problem with TensorFlow 1.14, but couldn’t make it work with TensorFlow 2.0, so I opened an issue.

TensorFlow Privacy provides a module optimizers where you can find the implementation of the differentially private versions of:

sgd → DPGradientDescentGaussianOptimizer
adam → DPAdamGaussianOptimizer
adagrad → DPAdagradGaussianOptimizer

Basically, the only thing that changes in your code are the instantiation of the optimiser, and a little modification to the loss computation, but everything else is unchanged as you can see in this piece of code that is part of mnist_dpsgd_tutorial_keras.py:

Extract from mnist_dpsgd_tutorial_keras.py

This piece of code assumes that a parameter flag is entered at execution to choose between the differentially private optimizer (FLAG.dpsgd=True) or the non-private one. In the former case, the optimizer is instantiated from privacy.optimizers.dp_optimizer.DPGradientDescentGaussianOptimizer, in the latter from tf.optimizers.SGD (TF2.0) or from tf.train.StochasticGradientDescent (TF<2.0). As you can see, DPGradientDescentGaussianOptimizer requires three additional hyperparameters:

l2_norm_clip
noise_multiplier
num_microbatches

l2_norm_clip (float): The maximum value that gradient computed across all network parameters and on each individual can assume.
Authors suggest that values from 0.5 to 1.0 should work reasonably well. This parameter is used to limit the influence of each training point, that is why is also useful to improve overfit by preventing the exploding gradient problem even when privacy is not a matter. Note that the gradient is not averaged over the entire minibatch, therefore, it needs a vector loss to be computed properly.

noise_multiplier (float): This parameter controls the amount of noise added during training. Generally, more noise results in better privacy and lower utility.
Authors suggest a minimum value of 0.3 to obtain rigorous privacy guarantees, but smaller values may still be acceptable for practical purposes.

num_microbatches (int): During each training step, the current batch of data is split into this many microbatches. This is done to improve performances as clipping gradients on a per example basis reduces parallelism. Basically, rather than clipping gradients on a per example basis, it is clipped on a microbatch basis. For example, with a batch size of 256 and 32 microbatches, gradient is averaged over 8 training examples so that only 32 gradients are clipped.

All theses hyperparameters can be tuned and chosen to trade off model performance with privacy achieved.

Privacy guarantee can be calculated thanks to another module provided by TensorFlow Privacy: analysis. This module is dedicated to quantification of achieved privacy guarantee which is expressed by ε and δ:

ε: upper bound on how much the probability of a particular model output can vary by adding or removing a single training point
δ: bounds the probability of our privacy guarantee not holding. Choosing delta smaller than the inverse of training data size is considered to be small enough.

In module analysis.rdp_account there are two methods: compute_rdp and get_privacy_spent:

compute_rdp takes a list of α (the orders at which Rényi divergence will be computed) and returns the Rényi differential privacy (rdp) achieved for each order α.
get_privacy_spent computes ε given α, rdp (as returned by compute_rdp) and a target δ or it computes δ given a target ε, α and rdp

Reference
Documentation on Github
Machine learning with differential privacy in TensorFlow by Nicolas Papernot

Discussion & Conclusion

The main takeaways of this article are:

deep learning algorithms might leak sensible information about individual data points in the training set by unintended memorisation
differential privacy is the mathematical foundation used to make a randomized process private by design and allows to control and quantify the privacy guarantee achieved
TensorFlow Privacy implements differentially private stochastic gradient descent

Although this article focuses on differential privacy and its implementation in TensorFlow, it has to be said that Apple also applies differential privacy and there is a lot to read:

I would also like to suggest to read this series of articles by Damien Desfontaines about privacy and differential privacy, they are really well written and understandable.

It is important to underline that differential privacy provides a controlled way of protecting user data in a dataset by assuring privacy by process. The way this is achieved is by adding noise to the process itself and not to the input data. This is a substantial difference that makes the achieved privacy guarantee precisely quantifiable.

Exploring differential privacy has been an interesting journey and I still have a lot of questions to answer like:

How to test if one’s model is “private enough” or what are the tangible residual risks of given epsilon and delta?
Can differential privacy help with biased datasets?
How to achieve differential privacy with transfer learning?
How to combine federated learning and differential privacy?
Is it reasonable to use differentially private stochastic gradient descent to prevent overfit or for anomaly detection?

Hopefully all these questions will be answered soon!

https://youtu.be/HUHDVhRGVi4