Idea of differential privacy and why is it important ?

Sourav Kumar
Secure and Private AI Writing Challenge
9 min readJun 30, 2019

Okay , so you have come here reading the article’s headline , then it means you are concerned regarding the topic but what if you don’t want anyone to see your reading lists on medium or on which articles you clapped ?

Let’s first understand ‘privacy’. What do you say , in layman terms it is related to simply isolating a information from everyone else except you.
But the concept goes more beyond this , and not limited to , includes following -

broader concepts of privacy [source]

‘Differential’ comes from difference , maybe you have earlier read about differential equations or something like that , but we will what differential means here in a moment.

Real life examples

# 1

Before we move now on technicality of the terms let’s visualize the premise of the situation , for that let’s take our first real life scenario-

Suppose there is a voting in a election and the votes as always are confidential and not to be seen by anyone even the election authority who is responsible for conducting the elections.

Now , there is a survey done after election for predicting results of the elections called opinion polls , so you decided to participate in one of these polls but some questions come in your mind which are one or more from -

  • Does your opinion will be confidential ?
  • Does your votes will be exactly as you casted or will be there some randomness added ? etc..

# 2

Let’s now consider another example but now more technical one -

So, Suppose you have a database table as follows -

Now here , the database administrator has cleverly suppressed one of the dept. info named ‘sales’ but do you think can we anonymize the data like this while doing same statistical results over the database ? We will see the answer in a moment.

# 3

Let’s see Final example -

Let’s say there is a hospital which has a database (or dataset) which has a list of patients having different kinds of diseases in different tables , so , let’s focus on one of the table which contain data about patients having cancer and has been simply anonymized by replacing the original info with ‘1’ for yes and ‘0’ for no.

original_db = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]

Now we have seen what type of data and database we are talking about here and now we are ready to dive in.
So, we want to maintain the privacy of data that is there should be no leak of information pertaining to one specific individual and maintain same kind of statistical results though not as much accurate as earlier.
So , to understand data leakage through database , we must understand and confirm that output of a database query is not dependent on any individual information of one person or instance from the database.

To make sense of this , let’s take above example of database containing info of cancer patients being labelled as ‘1’ and ‘0’.
We are interested in finding that if we run any query ( a function to fetch information from the database) , and then we remove a particular instance from database and then again run the same query and check if the results differ.

If the results differ then , then there is data leak of that particular instance from the database.

Sensitivity

So, suppose we have a database filled with 1’s and 0’s like in the above example :

Now it’s time for coding , stay alert !

Say , we have to generate randomly filled 1’s and 0’s in a database (for our example) and run a query maybe a sum function , maybe a mean function or any other function and for our purpose of checking if data is leaking or not we make parallel databases (the database with any one instance from original database removed).

so , if we have original database of length say 'n' , then all the n possible parallel databases will have length 'n-1'.
Let's see that in code.
[We will be using python and PyTorch framework]

So, let’s go one by one.

The first function 'create' contains torch.rand() which creates a randomly filled tensor picked from uniform distribution.
Then, it compares every value to the '0.5' and then returns '0' for false and '1' for true.
Finally we have a randomly filled 1's and 0's tensor.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

The function 'get_parallel_db' contains torch.cat() which concatenates two tensors containing [:index] which says take all the elements from 0 to given index - 1 and then [index + 1:] which says take all the elements from index+1 till last which says that it will remove the element at 'index'.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

The final function 'get_parallel_dbs' simply creates parallel dbs for every index that is it makes all the possible parallel dbs with instances removed one by one differently in every other parallel dbs and return a pythonic lists of all the parallel dbs.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

'create_db_and_paralleldbs' combines all the above.
Let's say:
create_db_and_paralleldbs(10)
db.size = 10
length(pdbs) = 10
pdbs[0].size = 9

So, let’s say:

db = tensor([1., 0., 0., 1., 1., 0., 1., 1., 1., 0.])pdbs =   [tensor([0., 0., 1., 1., 0., 1., 1., 1., 0.]),  
tensor([1., 0., 1., 1., 0., 1., 1., 1., 0.]),
tensor([1., 0., 1., 1., 0., 1., 1., 1., 0.]),
tensor([1., 0., 0., 1., 0., 1., 1., 1., 0.]),
tensor([1., 0., 0., 1., 0., 1., 1., 1., 0.]),
tensor([1., 0., 0., 1., 1., 1., 1., 1., 0.]),
tensor([1., 0., 0., 1., 1., 0., 1., 1., 0.]),
tensor([1., 0., 0., 1., 1., 0., 1., 1., 0.]),
tensor([1., 0., 0., 1., 1., 0., 1., 1., 0.]),
tensor([1., 0., 0., 1., 1., 0., 1., 1., 1.])]

Now , let’s run our statistical queries and analyse that if there is any difference in our outputs.

Now, What we do is first we define our query like sum or mean in a separate function.
Then, we create a function called 'L1_senstivity'.
What do we here needs some maths. So, let's see that first.

# some maths -

L1 norm : a norm is a function that assigns a strictly positive length or size to each vector in a vector space — except for the zero vector, which is assigned a length of zero.
For a vector it is simply taking difference of its components.

Now, we run our query over original db and then run over every parallel database and take l₁ norm , then take all that difference and return the maximum of them.

Formally this is defined as sensitivity which is said to be maximum change in the output of query of a function when we remove one of the instances from the database (that is all the information related to one specific person).

Mathematically,

where delta f is sensitivity and D is collection of datasets. [Source]
for sum :   tensor(1)
for mean: tensor(0.0005)

and sometimes the sensitivity may be dependent on how the instances are arranged in the database like for threshold function

def query(db, threshold = 5):
return (db.sum() > threshold).float()

here , if we have database whose sum was earlier more than five for one of paralle dbs whereas after removal of one instance its sum now becomes less than 5 and so , sensitivity here varies as the order of elements is arranged in the database itself.

Differential attack

Now , we have understood that sensitivity can be a measure of how much data is leaking through the query when run over a database.
So, we always prefer sensitivity to be as low as possible.
But now , we will move on and see why simple data anonymization or suppressing of information is not enough to secure data privacy and how can we attack on a database , and since here we will use our query , so its a differential attack.

remember our examples above , like # 2 example where we had database of business type and its budget , so now suppose you wanna know what was the sales budget of the department even if that data is not in the database.
What shall we do now ?
okay let’s first do this-
say , you have total of 10 apples , and then there are two people who took away your apples without asking , but one of them returned all that he took (4) but the other one didn’t. So , how will you figure out how much did other one took ?
simple , right ?
You simply subtract the known values from the total to get the unknown value, similar approach we do above.

REQUIRED_SALES BUDGET = TOTAL - SUM(RECRUITMENT + HR + RECREATION + LOGISTICS)

Hurrah !
We calculated the suppressed information.
Similarly if we have our query ready and apply our query approach to the # 3 example :

say you wanna , know the value of 10th row in the original database-
* sum(db) - sum(pdbs[10])
Above query will result in exact value of the 10th row and can reveal about the person if he has cancer or not.

We saw how we can attack deferentially on the database and find out the values suppressed in the database and dug out private and sensitive information from the database and so , it’s very necessary to add something to the data so that no body could reveal exactly what was inside.

This is called adding noise to the data , any random noise could be added to the data so that database curator cannot reveal the sensitive info.

It’s all based on the fact that who trusts whom in any system or process.

We will talk about adding noise in yet another amazing post !
Stay tuned !
Till then , think about how to add noise in the above database examples.

— — — — — — — — — — — — — — — — — — — — — — — -

images : copyright reserved with their respective owners.

For more such awesome stories , you can subscribe or follow me.

Feel free to share your insights on the differential privacy in the comments section below.

Clap it! Share it! Follow Me !

— — — — — — — — — — — — — — — — — — — — — — — — — — — — —

[THANKS FOR THE CLAPS]

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

--

--

Sourav Kumar
Secure and Private AI Writing Challenge

Deep Learning 💻| Machine Learning 📊| Full stack Web development 🌐| cosmos lover 👨‍🚀