Introducing Differential Privacy

Anirban Sen

Published in

Secure and Private AI Writing Challenge

5 min readAug 15, 2019

Privacy in today’s world — Photo by ev on Unsplash

Most datasets are siloed within large enterprises for two reasons:-

Enterprises have a legal risk which prevents them from wanting to share their data set outside the organisation.
Enterprises have a competitive advantage to hang to large datasets collected from or about their companies

This leads to one challenging consequence, scientists are extremely constrained in terms of the amount of data they have access to solve their problems

What if privacy is preserved? What if the enterprises need not compromise with their privacy and also scientists are able to solve the problems? This is where Differential Privacy comes in.

Now first let's see when can we say that privacy is preserved:

In general, privacy is preserved if, after the analysis, the analyser doesn't know anything about the people in the dataset. They remain “unobserved”.

For example, suppose we have taken CT scans to predict cancer. Now, we say privacy is preserved if the analysis of how the cancer cells look like in general shouldn’t tell us about which person has cancer.

Cynthia Dwork, a distinguished scientist at Microsoft Research and often known as the Godfather of Differential Privacy defines differential privacy which has been accepted as the industry standard as:

Differential Privacy describes a promise made by a data holder or curator, to a data subject and the promise is like this “You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, datasets or information sources, are available”

The initial idea of preserving privacy was to anonymise the data which was badly broken.

FYI, Netflix published 10 million movie rankings by 500,000 customers, as part of a challenge for people to come up with better recommendation systems than the one the company was using. The data was anonymized by removing personal details and replacing names with random numbers, to protect the privacy of the recommenders. Arvind Narayanan and Vitaly Shmatikov, researchers at the University of Texas at Austin, de-anonymized some of the Netflix data by comparing rankings and timestamps with public information in the Internet Movie Database, or IMDb.

Another instance was when William Weld, then Governor of Massachusetts, assured the public that Massachusetts Group Insurance Commission had protected patient privacy by deleting identifiers. In response, then-graduate student Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.

This brings us to a new solution called Canonical Database:

Key to the definition of differential privacy is the ability to ask the question “When querying a database, if I removed someone from the database, would the output of the query be any different?”. Thus, in order to check this, we must construct what we term “parallel databases” which are simply databases with one entry removed.

Lets see an example:

In this section we’re going to play around with Differential Privacy in the context of a database query. The database is going to be a VERY simple database with only one boolean column. Each row corresponds to a person. Each value corresponds to whether or not that person has a certain private attribute (such as whether they have a certain disease, or whether they are above/below a certain age). We are then going to learn how to know whether a database query over such a small database is differentially private or not — and more importantly — what techniques are at our disposal to ensure various levels of privacy

In [1]:

#This function removes the remove_index entry from the main db and returns the output as parallel db with one entry removeddef get_parallel_db(db, remove_index):
    return torch.cat((db[0:remove_index], db[remove_index+1:]))

In[2]:

#This function takes a db as a parameter and for each of the entry in db calls get_parallel_db function with that entry removed and creates a list of the parallel dbsdef get_parallel_dbs(db):
   parallel_dbs = list()
   for i in range(len(db)):
        pdb = get_parallel_db(db, i)
        parallel_dbs.append(pdb)
   return parallel_dbs

In[3]:

#This function creates a randomly generated db with number of entries as a parameter and calls get_parallel_dbs for creating parallel dbs
 
def create_db_and_parallels(num_entries):
   db = torch.rand(num_entries) > 0.5
   pdbs = get_parallel_dbs(db)
   return db, pdbs

In[4]:

#Creating a db with 20 entries along with its parallel dbs i.e. one entry removeddb, pdbs = create_db_and_parallels(20)

Intuitively, we want to be able to query our database and evaluate whether or not the result of the query is leaking “private” information. As mentioned previously, this is about evaluating whether the output of a query changes when we remove someone from the database. Specifically, we want to evaluate the *maximum* amount the query changes when someone is removed . So, in order to evaluate how much privacy is leaked, we’re going to iterate over each person in the database and measure the difference in the output of the query relative to when we query the entire database.

Just for the sake of argument, let’s make our first “database query” a simple sum. Aka, we’re going to count the number of 1s in the database.

In[5]:

#Creating a query over the database i.e. sumdef query(db):
    return db.sum()

In[6]:

#Running the query for the entire database dbfull_db_result = query(db)

In[7]:

#Running the query for each of the parallel database pdb and getting the maximum difference from the result of the query on the entire db which we call as sensitivitysensitivity = 0
for pdb in pdbs:
    pdb_result = query(pdb)
    
    db_distance = torch.abs(pdb_result - full_db_result)
    
    if(db_distance > sensitivity):
        sensitivity = db_distancesensitivity

Out[7]:

tensor(1)

As you can see, the basic sum query is not differentially private at all! In truth, differential privacy always requires a form of randomness added to the query. In the next blog we will read about how we add randomness to make a query differentially private. Till then, keep your differntially private glasses on :)

#I have written this blog as a part of Udacity Secure and Private AI Challenge

Introducing Differential Privacy

Written by Anirban Sen