Differential Privacy

I listen to podcasts so you don’t have to.

Published in

From the Diaries of John Henry

17 min readMay 13, 2018

(I mean I’m not saying don’t listen to them at all, just like, you know, here are some cliff notes among other things. Seriously, you should listen to the podcasts too.)

The Band with Eric Clapton — Further on up the Road

Introduction — Service Oriented Architecture

I remember a ways back when I was younger and a little more naive I decided was going to explore the realm of computer science. Although had performed some coding projects in school, most of my undergrad classes were devoted to the realm of engineering theory so my foundations in the sphere were admittedly a little wobbly. When you don’t know where you’re going any road is as good as another, and so completely on a whim I found a somewhat random concept / keyword that I had recently been exposed to and decided what the hell let’s see what this means — that keyword was “service oriented architecture”. Now I wasn’t quite the wikipedia guru then that I am now so the reasonable starting point was to pick up a (several hundred page) book from good old Barnes and Noble. A chapter or two in found me even more confused than when I started, and chalking it up to poor writing I set it aside and thought ok well let’s see where else I can learn. I found a (well known) tech company that apparently was an expert in the field and low and behold they were even kind enough to send a 30 page or so FAQ guide in the mail, free of charge! The process of reading these works I can’t even describe how mind numbing it was. It was hard to keep my eyes on the page, they kept glazing over. It finally occurred to me that if I’d read upwards of 50 pages supposedly devoted to introducing a subject and still had no clear idea of what it meant, well perhaps it wasn’t meant to be understood at all. Perhaps this lack of a precise meaning is exactly what was intended. Perhaps the keyword was merely meant to allude to something, such as architecture best practices, that could be peppered into presentations and pitches by “experts” to confuse those outsiders who might naively assume that the speaker was actually trying to communicate something of substance. Perhaps this concept was little more than pure marketing speak (to be clear “marketing speak” is not a compliment). This process of nonsensical discovery turned me off to the whole field for a good while. In hindsight I’m reminded now of an anecdote told in the book Idea Makers by Stephen Wolfram with a relevant quote by Richard Feynman.

…one day a chap from a well-known computer science department came to speak. I think he was a little tired, and he ended up giving what was admittedly not a good talk. And it degenerated at some point into essentially telling puns about the name of the system they’d built. Well, Feynman got more and more annoyed. And eventually stood up and gave a whole speech about how “If this is what computer science is about, it’s all nonsense….” — Stephen Wolfram, Idea Makers

(The 1981 version of Hitchhiker’s was perhaps a little more true to the quirky spirit of the series!)

Having lived through this SOA experience, it was thus with a touch of trepidation that I first came across the phrase “differential privacy”. On its surface it has a ring of substance, obviously privacy is an important consideration in today’s world of big data, and differential implies a level of sophistication about how privacy considerations might be addressed. But does differential privacy have a precise meaning? Is it just marketing speak meant to vaguely allude to best practices, or is it actually a discipline that can be understood? Answering these questions will be the goal of this post. To keep this project manageable I’m going to limit my research to each of a recent series of (excellent) podcasts offered by the This Week in Machine Learning and AI (aka TWiMLAI) on the theme — those can be accessed here [part 1, part 2, part 3]. I’m writing this introduction before having listened to the bulk of the podcasts so truth is I’m not really sure what the answer is going to be, but hopefully in the process might learn some useful tidbits about privacy considerations in big data and machine learning. If any interested reader might come away from this with an improved understanding of the field I’ll consider the journey a success, even if it is only myself. Finally, while I get that some people are averse to unnecessary acronyms (looking at you Musk), the truth is I’m a lazy typist and since the somewhat unwieldy term ‘differential privacy’ is going to come up all over the place in this essay, I’m going to take a shortcut and refer to it as DP from here on out — so please read DP as ‘differential privacy’. And without further ado.

That which comes into the world to disturb nothing deserves neither respect nor patience. — René Char

Buckwheat Zydeco — Hard to Stop

Part 1 — A Precise Meaning

It was with good intent that Netflix offered their $1 million Netflix Prize, a competition open to the public spanning from 2006–2009. An early example of the crowdsourcing of R&D that has continued to good success even today through channels such as Kaggle competitions, the goal was very simple: the team that demonstrated the biggest improvement to the Netflix movie recommendation engine (as evidenced by prediction accuracy of user ratings of subsequent title viewings) would be awarded a lump sum prize. This competition was notable not only for paving the way for competitive outsourcing of research (following in the vein of the perhaps more moonshot oriented X Prize Foundation), but also because of the not insignificant set of Netflix proprietary user data that was released to the public for purposes of algorithmic training of competition entries. In today’s world of open source machine learning frameworks such as TensorFlow and PyTorch, one of the key moats left to those building machine learning implementations is access to a proprietary pipeline of training data. In fact I bet there’s an argument that the value of the released Netflix training data (on order of 100M user ratings of movie titles) was at least on par with the value of the prize money itself. Releasing user data is even more sensitive when you consider US video privacy laws that date back to the days of Blockbuster. Of course Netflix was aware of these privacy laws, and even took the common sense precaution of stripping the identifying user information (such as name or associated email address) from the data prior to release. It was thus met with surprise when some researchers subsequently announced that they were able to reconstruct the identity of a sizable portion of users in the Netflix Prize training dataset. The underlying weakness turned out originate from the availability of another public data source of movie ratings from IMDB with identifiers included, which when cross-referenced with the ratings from the Netflix data revealed a plethora of users with similar or identical rating patterns — i.e. a linkage attack. This unintentional exposure of private user data, despite the common sense precautions, exposed Netflix to legal risk, and demonstrated that common sense precautions alone may not be enough to protect user information in today’s world of big data and associated machine learning implementations. These type of considerations for what precautions are appropriate to protect user data, balanced with algorithm performance considerations, are exactly what the field of DP is meant to address. (Again, for purposes of this essay DP means ‘differential privacy’.)

(Cary Grant’s collaborations with Alfred Hitchcock are also very good!)

To release some of the suspense from the introduction, I am happy to report that DP does appear to have a precise meaning rooted in mathematical proofs and concrete implementation considerations. (This determination was a relief to this author.) Before diving further into that definition though I’d like to first expand a little on what kind of data leaks the field is meant to address, as it turns out that the linkage attack weakness illustrated by the Netflix prize example is only one of the channels that can expose user data. To borrow from a convention common to computer and information science, I’ll offer some illustrative parties of Alice, Bob, and Eve to demonstrate. For our purposes Alice will be one of many users of a service whose data has been harvested by Bob for use in training a machine learning implementation, and Eve will be a party trying to recover or reconstruct the data as provided by Alice based on what is shared by Bob. As may be a relief to Alice, for our examples we’ll assume that Eve is only interested in legal channels of interrogation, so she won’t engage in database breaches or have access to anything that Bob doesn’t intentionally release to the public. Now there are a few different types of sharing that Bob may consider:

In the first case Bob may share a dataset to allow others to train their own models.
In a second case Bob may share only the neural network weightings of a model that he himself trained on a set which included Alice’s data.
In the third and most conservative scenario Bob may only allow third parties to query an API that provides an output of a model that he had trained on a set that included Alice’s data.

We must be weary of intuition failing us here, for a lazy assumption would be that in only the first case of Bob sharing a dataset would there be potential for Eve to reconstruct Alice’s data. It turns out all three types of sharing have the potential to expose Alice’s data without the incorporation of DP.

The vectors of attack are certainly different for these three types of release, and thus the mitigation tactics are correspondingly distinct. I’m going to leave discussions about potential mitigation tactics for the second half of this essay, but I think it would be of value to quickly expand a few of the ways that each of the three types of sharing can be breached. We’ve already seen that a dataset, even one stripped of specific identifiers, can be reconstructed by matching parts of the remaining pieces with external data sources. However, if Bob doesn’t release a dataset but only an associatively derived machine learning model, it was not initially to me obvious how this could expose a single data point in what could be a deep training corpus. To illustrate one way this can occur, let’s consider a classifier model trained on some feature set that outputs a prediction with a corresponding percentage of certainty. Now as is true of most models, even if sometimes only to a small degree, the classification network will tend to overfit to specific data points that were included in the training set. Thus Eve will find that when she enters Alice’s known features to Bob’s API, even though she can’t see the internals of the model, that the certainty band of the classification output will be higher if Alice’s specific feature set was included in the original training corpus. This type of certainty strips Alice of any plausible deniability of whether the classification was actually correct. For other cases where Eve may have access to the full model weightings, the logistics of how she can reconstruct Alice’s data isn’t directly addressed in the TWiML podcast series but it is certainly alluded to, presumably taking advantage of similar principles of a model’s tendency to overfit to examples found in the training corpus.

The potential for properties of a single data point from the training corpus to be inferred even from merely a series of API queries of a model trained on the set can be abstracted by considering a case of two hypothetical models X and Y, with model X trained on the full training corpus and model Y trained on the same corpus excluding a single point. In a perfect world of complete privacy, a model would be fully dependent on abstract generalized features without undue influence of any single point from the training set, thus these two hypothetical models X and Y would be identical. For larger influence of a single point from the training set (meaning the two models X and Y are shifted further apart in their weightings) the potential for privacy leakage grows. This turns out to be one of the key findings of DP, in fact I expect is the source of the word ‘differential’ in the name. By comparing the influence of that single point on these two hypothetical models X and Y, we can derive a measure of privacy designated as epsilon, basically by evaluating how distinguishable are the two models X and Y. Epsilon ranges from 0+ unconstrained. A value of zero implies perfect privacy (X = Y) and as epsilon grows the data leakage becomes more significant. This epsilon parameter is a fundamental measure for DP considerations, and my takeaway from the podcasts was that there are a series of mathematical proofs supporting, that from this measure epsilon for a given model there can be derived proofs of certainty bands for privacy. Hence a precise meaning.

It is by teaching that we teach ourselves, by relating that we observe, by affirming that we examine, by showing that we look, by writing that we think, by pumping that we draw water into the well. — Henri Frédéric Amiel

Van Morrison — Real Real Gone

Part 2 — A Prescription

In part 1 of this essay, we explored potential channels for leakage of data from sources ranging from a full data set to merely queries to a model trained on that same set. We developed a demonstrative convention of Alice (user with some data included in a training set), Bob (machine learning architect holding Alice’s data), and Eve (party trying to reconstruct Alice’s data using Bob’s output) — which we will continue using here. We described a measure of privacy epsilon which is a fundamental feature of differential privacy (DP), derived from the two hypothetical models X and Y which were trained on identical training sets save for the inclusion of a single point. We even included some pictures and music videos which although I expect some may argue are unprofessional and distracting I would counter serve as a colorful harmonizing counter to the narrative that could help to draw interest from a more diverse audience.

We have briefly touched on the concept of Alice’s plausible deniability for her feature set characteristics as recovered by Eve. This is related to our measure of privacy epsilon in that as epsilon grows such will the certainty band of Eve’s reconstruction, and an infinite epsilon would correspond to Eve’s complete certainty. This half of our essay is meant to address prescriptive measures that can be taken to manage our epsilon rating of a model, and instead of beating around the bush I’ll just go ahead and offer that the simplest way to obfuscate Eve’s inference is for Bob to introduce some level of noise or randomness to the equation. As a simple example, consider the Netflix prize data release. If Netflix incorporated random ratings for a subset of their released user data, then Eve’s reconstruction would have to consider that Alice’s rating of some particular movie title may just be a random fluke not not truly indicative of her preferences. While this type of randomness would ambiguate a consideration for individual users in the dataset, for the aggregate of users the structured nature of the noise allows the training process to account for the obfuscation and wash out the effect, just like how noise-cancelling headphones wash out background noise to allow a listener to focus on the music.

Noise cancelling demonstration via wikipedia

Bob’s introduction of noise doesn’t have to take place at the feature dataset layer, other options to facilitate DP include noise injected during the model training (such as part of a backpropagation’s stochastic gradient descent), or alternatively at the activation output of the model layers. Of course it’s not enough to just inject noise, the process has to be done intelligently to balance considerations for algorithm performance. This calibration of noise properties with an associated injection point so as not to harm performance, synergizing between privacy and utility, is the art of DP implementation. The good news is that the two considerations of privacy and utility don’t necessarily have to be at odds. Any attempt to reduce influence of a single data point from a model’s training should coincide with a regularization effect, after all we’ve already pointed out that when epsilon = 0 (when models X=Y) a model would be fully dependent on abstract generalized features of the training properties, devoid of overfit. When the partial ambiguation via noise injection to a dataset isn’t sufficient, the podcasts suggest another approach (one still primarily in the realm of academic research), in which a dataset is fully anonymized by replacing the actual data points with a synthetic replacement set generated via generative adversarial networks (GANs), while maintaining characteristic features of the underlying data. In theory this synthetic dataset, when applied to train a neural net of your choice, would facilitate a comparable model as would have been developed with the original data.

Example of GAN generated synthetic data (of movie star faces), via Progressive Growing of GANs for Improved Quality, Stability, and Variation by Karras, Aila, Laine, and Lehtinen

The GAN generation of synthetic data is not the only prescription offered in the TWiML podcast DP series. In the closing interview with Nicolas Papernot, the discussion centers around a recently published paper addressing a novel DP tactic known as the Private Aggregation of Teacher Ensembles, or PATE. The PATE system is based on introducing an intermediate layer of predictions to the training in a 3 step process.

The training set is first split into a collection of non-overlapping partitions (the author suggests a range of 100–1,000 partitions based on size of the training set). After developing hyperparameters for a prototype model on one of the partitions, the rest are all trained in parallel.
Once this collection of models (known as the “teachers”) are trained, they are treated as an ensemble network. Note that as part of the teacher evaluation it is required that you incorporate some degree of perturbations into the outputs of individual teacher predictions — this serves as the noise injection for purposes of DP. After randomized perturbations are introduced to the individual teachers, the aggregation of the full set is derived.
The third step is to train a single student model based on labels generated from the teachers’ aggregation.

It is then the student model that is our final training, and this is the model that is used going forward for material visible to the public. Even if the internals of the student model are released to Eve, Alice’s data is safe because Bob’s student model did not have direct access to the original training set, only an aggregate of the teacher models post perturbation / noise injection such that the DP was already incorporated. The teacher only answered a fixed number of queries derived from the actual training data in the process of training the students, so no matter how much Eve subsequently solicits predictions from the student model the amount she can learn is fixed.

Don’t learn to do, but learn in doing. Let your falls not be on a prepared ground, but let them be on a small scale in the first instance till you feel your feet safe under you. Act more and rehearse less. — Samuel Butler

The Band — I Shall Be Released featuring Bob Dylan, Ringo Starr, Ronnie Wood, Joni Mitchell, Neil Young, Neil Diamond, Ronnie Hawkins, and Van Morrison

Conclusion — The C Programming Language

I started this essay by reminiscing on some regrettable circumstances around a tech firm’s treatment of someone with the simple goal of trying to find an opportunity to work with people he looked up to. This was by no means the first time I’d been fed nonsense and I’m sure it won’t be the last, but it had the unique conditions of actually having a specific known source so perhaps it sticks out a little more than it should. These concluding thoughts aren’t really directed at any one firm though, as a job search or two over the years has exposed me to some commonalities of the types of practices that are sometimes directed to the throngs of job seekers and aspiring professionals who for whatever reason just might not be hirable. I look at the current state of DP and see an emerging field with societal scale implications for an economy that is growing ever more reliant on the pipeline of big data, with fundamental questions about citizen or consumer rights to privacy at play. Given the importance of the matter, I think those of us who wish to participate in this field have a certain responsibility to the public to communicate in clear and precise language that doesn’t distort, misrepresent, or intentionally confuse. Yes you can pepper your writings with emojis or a soundtrack or whatever other creative knick knacks you think might help draw interest, but when it comes to the core of your message those of us who have been blessed with a megaphone and misuse it deserve to get it taken away. There is too much at stake.

“If it’s not right, don’t do it. If it’s not true, don’t say it.” … “Nature designed rational beings for each other’s sake: to help — not harm — one another, as they deserve.” — Marcus Aurelius, Meditations

To those firms who field countless inquiries from job seeker applicants, I have the simple request that you reevaluate your means of response. From my experience these tactics, often driven by convincing applicants that “hey while we’re still your friend you probably wouldn’t want this job anyway” that you miss the opportunity to have a positive influence on their lives. These applicants look up your teams as role models, for through your successes you’ve demonstrated that much is possible and can be accomplished. Of course you’ve earned the right to be selective, but when you reward those who look up to you with silence and indifference these applicants slowly become disheartened, dreams get set aside, ambitions chipped away. Here’s what I suggest. Show them what it would take to succeed. You don’t have to give out blueprints to your products and strategy, but think about offering a few from 30–40 years ago, so even if they’re seeing outdated tech at least they will see how to read and draw blueprints. It is one thing to decline avenues for collaboration with an applicant who has expressed interest in your organization (trust me I’ve had my share of this), but to actively introduce material that sabotages their interests and curiosity, well I think it is a troublesome way to treat people. If you want to distract people and steer them down other roads, offer them a copy of The C Programming Language, the gold standard of writing in the field of computer science, one of the best example I’ve seen of the precision and clarity of thought that has potential to seed a revolution. If you are not interested in working with someone, tell them in clear and precise language. Explain to them that you prefer to collaborate with people who are change agents, who have built something of importance or contributed new ideas to their field. Give them something to aspire to. Try to live up to deserving the good fortune of which you may have been blessed.

A time of true technological revolution isn’t a time for exultation, or for despair either. It is a time for work and responsibility. — Peter Drucker

*For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations.

Books that were referenced here or otherwise inspired this post:

The C Programming Language — Brian Kernighan and Dennis Ritchie

(As an Amazon Associate I earn from qualifying purchases.)

Albums that were referenced here or otherwise inspired this post:

The Last Waltz — The Band

Buckwheat’s Zydeco Party (CD) — Buckwheat Zydeco

His Band and the Street Choir — Van Morrison

The Last Waltz (Blueray)