Deep Learning with Differential Privacy

Shubhangi Jena
Secure and Private AI Writing Challenge
5 min readJul 22, 2019
Source : Google

Simplifying the jargons for the adults.

When Albert Einstein said and I quote, “If you can’t explain it to a six year old, you don’t understand it yourself,” I decided to join the cause albeit I took it up for the adults.

Because data and privacy has entered your usual chat-in-the-local-pub jargon and let’s be honest about adulting — that thing needs superpowers, though!💪

So here’s the deal, let’s break down atleast differential privacy and related jargons for the folks out there!

But first, the fundamentals —

What is Deep Learning?

Machine learning techniques(here, deep learning is a subfield of machine learning)based on neural networks require large datasets to train models. These datasets are often crowdsourced and undoubtedly, it comes with the users’ private information or patterns that can trail to identifying some key information about an user/a group of users concerned.

The goal therefore lies in preserving the users’ sensitive data achievable within the framework of ‘differential privacy.’

What is Differential Privacy?

Differential privacy can simply be defined as a constraint on the algorithms that publish information as an aggregate about a statistical database by limiting the amount of information that is exposed.

Primarily, statistical database here refers to a database that is collected under the agreement that the users’ private information shall not be disclosed.

In effect, differential privacy wraps your private information with some amount of noise and the resultant data has limited privacy loss.

Source: Secure and Private AI ND by Udacity | In-picture: Andrew Trask

Let’s understand with the help of an example —

Adult you, loves to shower your chats with emojis (✔✌❤🌹🍟🍔🍕). What if you choose to allow your Grammarly keyboard to collect ‘information’ on the emojis used. So let’s say, everytime you use (🍕) it’s marked as 1 against everytime you don’t use it as 0. This data from you and many other adults will be stored in the database as simple 0’s and 1’s.

Now, Grammarly has decided to count which emoji is preferred the most — although you might think only 0’s and 1’s can do no harm, a database that also contains your other information can show potential signs of ‘data leakage’.

Here, Differential Privacy or DP for short prevents your private data from leaking and oozing out by adding ‘random noise.’

Now, let’s talk nerdy!

Understanding the math behind differential privacy and a break up of more jargons —

Differential privacy’s formal definition given by Dwork, Nissim, McSherry and Smith, with major contributions by many others over the years is as follows —

A randomized mechanism M: D → R with domain D and range R satisfies (ε, δ)-differential privacy if for any two adjacent inputs x, y∈ D and for any subset of outputs S ⊆ R it holds that

Pr[M(x) ∈ S] ≤ e^ε * Pr[M(y) ∈ S] + δ.

It states that for two databases say x and y with a difference of only one row (row here could be addition of a new person’s data or removal of one person’s data); the probability of computation of the resultant output is usually the same.

Understanding with the help of an actual dataset — tried and tested!

For this I chose to pick Goodfellow and Papernot’s example from their blog here.

While this blog here you are reading now is based on Goodfellow and his peers’ published paper, I found his explanation of PATE alongwith illustrations, a perfect start for beginners.

Take a look at the PATE differential privacy analysis and how introducing ‘randomness’ protects the user’s privacy.

PATE Framework:

The PATE framework is based on breaking the dataset into smaller subsets and introducing a teacher-student model. Where teacher will be the trained models based on the ‘private data’ and the student will be the model tested on the trained models.

Making it simpler here : We create subsets of one large dataset, say medical records’ database is further divided into individual records. Individual teachers train on individual records. Now we need to test if Jane here is healthy or unhealthy. Without introducing noise the output can be a clear giveaway of the private information in any medical records, see image below.

PATE-Training. Source: cleverhans-blog
PATE-aggregation

Ensuring Privacy:

Adding noise which is sampled from Laplacian or Gaussian distribution, we preserve the privacy of data. Now let’s take an example where our subject Jane Smith has cancer, if we have two classes that indicate her to be healthy and two that indicate otherwise, the test dataset can predict an outcome that prevents from disclosing any sort of private data about Jane’s records.

Since there are two classes here — ‘healthy’ and ‘unhealthy’, adding noise would mean that the class with more noisy votes will be the outcome.

In the previous case, since there were three classes voting healthy, addition of noise could not preserve privacy of the user’s data used for training.

PATE Aggregation | Source: cleverhans-blog

Up and Close!

A glossary to understand things better —

  1. Privacy budget/Privacy Loss/Data Leakage — Interrelated terms indicating the amount of data we can allow or agree to expose such that the privacy is not disclosed. In an example where the privacy is compromised, we say the privacy budget has exceeded, privacy loss has been higher and data leakage would mean key information oozing out.
  2. Privacy accounting — calculating the privacy cost for each time the training set is accessed.
  3. Sensitivity — The maximum amount that the query changes when removing an individual from the database.

The Bottomline

In a nutshell, what does differential privacy aim to achieve —

Source: Secure and Private AI ND by Udacity | In-picture: Andrew Trask

Hey adults, Go on and brag all that you want now! 🎉🎊

About the author:

Shubhangi Jena is a #FacebookUdacityScholar and currently undertaking the Secure and Private AI Nanodegree by Udacity (also includes Udacity India)and @FacebookAI. This blog was a part of the Challenge Course and implements the learnings to better help other enthusiasts out there and can be leveraged in your learning quest. Happy Learning & Stay Udacious 😉

Questions/Suggestions? Ping me on LinkedIn: www.linkedin.com/in/shubhangijena

--

--

Shubhangi Jena
Secure and Private AI Writing Challenge

Donning the tech-enthusiast avatar here; my interests swing between humanities and sciences or wait, I stand at the cross-section of them both!