My notes from Lesson 3 Introducing Differential Privacy

calincan mircea
Secure and Private AI Writing Challenge
7 min readJun 19, 2019

part of :

Secure and Private AI Scholarship Challenge from Facebook

Secure and Private AI by Facebook Artificial Intelligence

#Secure & Private AI Program

Scientists like us are often extremely constrained in terms of the amount of data they have access to to solve their problems.This challenge is quietly holding back research across society and nearly every person facing industry,Making it more challenging to cure disease or understand complex societal trends.

The biggest tragedy of the situation is that the more personal to data and the more personal the potential uses of that data in society,

The more restricted it is from scientists. This means that some of the most important and personal issues in society simply cannot be addressed with machine learning.Because we do not have access to the proper training data.

But by learning how to do machine learning that protects privacy,

  • You can make a huge difference in humanity’s ability to make progress, curing disease, and truly understanding who we are through our data.
  • I will be teaching you state of the art techniques for privacy preserving artificial intelligence such as Federate Learning and Differential Privacy,

#Lesson 1 Introduction

differential privacy in the context of deep learning.

= is about ensuring that when our neural networks are learning from sensitive data,that they’re only learning what they’re supposed to learn from the data without accidentally learning what they’re not supposed to learn from the data.

  • we’re going to walk through differential privacy from the early fundamentals to one of the state-of-the-art methods for training differentially private deep learning models.

The goal to understand the foundational principles of differential privacy such as how noise is applied or how we define privacy.

What for ?

  • to be able to speak intelligently about the subject to engineers,scientists, and stakeholders;
  • to be able to continue to consume differential privacy research as it’s produced.

#What Is Differential Privacy?

Differential privacy

  • is actually quite a new field;
  • It recently started with statistical database queries around 2003 and even more recently,
  • has been applied to contexts such as Machine Learning.
  • The general goal is to ensure that different kinds of statistical analysis don’t compromise privacy.
  • Statistical analysis= training data or database or just a dataset about individuals and we want to make sure that our statistical analysis of that dataset does not compromise the privacy of any particular individual contained within that dataset.

Robust definition of privacy =

  • if after the analysis, the analyzer doesn’t know anything about the people in the dataset.They remain “unobserved”.
  • In 1977 “Anything that can be learned from a participant in a statistical database, can be learned without access to the database”.
  • Modern definition by Cynthia Dwork who is the “Godfather” of differential privacy. = “Differential privacy” describes a promise, made by a data holder,or curator, to a data subject:”You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis,no matter what other studies, data sets,or information sources, are available.”(“The Algorithmic Foundations of Differential Privacy.”)

#Can We Just Anonymize Data

“You will not be affected,adversely or otherwise by allowing your data to be used in any study or analysis,no matter what other studies, datasets or information sources are available” => the most challenging and new ones aspects of differential privacy.

Example :

X Netflix competition / Netflix prize, which gave away a million dollars to the team which could build the best movie recommendation system.

Netflix published an anonymized dataset over a 100 million movie ratings from half a million users, so that the teams in the competition could train the machine learning models.

Specifically, the names of the movies and the username

that the individuals have been replaced with unique integer ideas.

=> theory, the dataset didn’t disclose private information about the individuals doing the rating ,they didn’t even disclose the names of the movies being rated either.

=>Two researchers at the University of Texas where able to de-anonymize both the names of the movies and the names of the individuals using a clever statistical technique. They scraped the IMDB movie review site and use statistical analysis to find individuals who were rating movies on both IMDB and on Netflix. => de-anonymize a large percentage of the users on Netflix as well as the titles of the movies that they were watching.

X In 1997,a similar technique was used to de-anonymize health records, by looking at multiple, separate anonymized data sets as well as online voter registration records.Which ultimately led to the re-identification of

Massachusetts Governor William Weld’s medical records.

X

2013 a Harvard professor re-identify the names of more than 40 percent of a sample of anonymous participants in a high-profile DNA study, by cross-referencing them with other publicly available datasets.

This notion of “no matter what” other sources datasets or information sources are available is a huge deal.

#Introducing The Canonical Database

We will playing around with PyTorch or Numpy or some statistical tool set or dataset in the context of deep learning.

  • Get the first look at privacy in the context of a simple database.

Example:

X If this database we’re regarding cancer, a one might indicate that a person has cancer and a zero might indicate that an individual does not have cancer.

  • what are the tools and techniques and the way we should think about these things in order to protect the information while still performing our statistical study?
  • how did we define privacy in the context of this example database?
  • Well, given that we’re performing some query against the database =>if we remove a person from the database and the query doesn’t change, then that person’s privacy would be fully protected.
  • if the query doesn’t change even we remove someone from the database, then that person wasn’t leaking any statistical information into the output of the query.
  • Now the big question, can we construct a query which doesn’t change no matter who we remove from the database?

#Project Demo Build A Private Database In Python

  • So in this project, what we want do is we want to take a database that exists, and we want to automatically generate every possible parallel database to that original database.
  • In other words, we want to iterate through this database, and at each point,we want to remove an entry,and then copy the rest of the database into a what we call a Parallel database.
  • So what we should end up with, is a list of many databases,each of which having one entry removed.
  1. So first, we’re going to join a function that creates a parallel database where we specify which value in the original database you want to remove,
  2. We are going to create a second function which iterates over all values in the database,generating a copy with one removed at that index.

A database is simply a large tensor with single value missing,so we have database here,the shape of this database is simply of like 5,000, so it’s a vector,

What we want do is we want to first copy the database, and then remove an entry => all we’re going do is we’re going to slice off the entries,

before the one we want to remove, and the entries after the one we want to remove, and concatenate them together with torch.cat()

Get_parallel_db, and we’re going to pass in two parameters, one being db, and one being the remove_index parameter, which specifies which index we went to remove,

Whereas previously, the length of the entire database was 5,000,

when you remove an entry of course,the shape drops to 4,999.

Now, interesting thing, so if actually passing an index that did not exist,

so some large number.It does not throw an error,it simply returns the entire database.

The next thing we want do is, we want to write a second function which will iterate over every single value in our first database and create one of these parallel databases with that value removed.

--

--