My notes from Lesson 4 #Evaluating The Privacy Of A Function
part of :
#60daysofudacity
Secure and Private AI Scholarship Challenge from Facebook
Secure and Private AI by Facebook Artificial Intelligence
#60daysofudacity
change sensitivity back to max_distance.
What we want do here is really just consolidate this code into a single function which accepts this query function as an input parameter and then returned to sensitivity of this function.This is empirically measuring sensitivity, which will get a sense intuition for, if we were to actually remove you from a database,how much would the query change.But in the real world, we can typically know about sensitivity through other ways.sensitivity and we’re going to accept two parameters:
- the query, which is a function object
- number of entries in the database that we want to query with.
So here we replace this 5,000 with this in entries variable. Then, we’re going to query the database that we created, calculate the maximum distance from the full database query and every parallel database, and return this max_distance.
Let’s calculate sensitivity of this sum function.As we saw before in the previous video, the sensitivity is one.
If we change query function by cast it to a float and calculate a mean.As you know mean it’s just a sum divided by the number of entries. So we should expect that whereas previously our sensitivity is one because we are dividing this by a large number, this should decrease sensitivity quite significantly => sensitivity empirical sensitivity here is 0.0005 and this corresponds to the average value in the database.This is typically returning something like 500 or something near 500. => average value here would be 0.5 divided by the number of entries.
Since this database is randomly generated, if we ran into sensitivity a bunch of times, we’d actually get numbers just above and just below. It would approximate the true sensitivity which is really what this sampling does but as it happens here,it tend to line up with the exact sensitivity of this function.The nice thing about this convenient function here is that we can pass in arbitrary functions. We can empirically get a sense for how much the output of this function changes when we remove a person from a database.Let’s create one of size 20. In this database, every single one of these values corresponds to a different person or we’re sort of assuming this.
When removing someone from the database, when calculating every parallel database, the assumption is that none of these are accidentally referring to the same person because the the notion of sensitivity is not about how much the function changes when we remove a value from the database, it’s about how much a function changes when we remove all values corresponding to a person from the database.We care about the sensitivity to people,not necessarily the sensitivity just to individual values.Now, it so happens that that’s the same thing in the databases they’re using right now, but I really wanted to call out this notion just so that you get a feeling for what we’re really trying to calculate here.What we’re really trying to calculate is how much the output value from this function is using information from each individual person database,or is it only an aggregation of the information that actually multiple people are contributing?We’ll find that it’s a lot easier to protect privacy,if the information that’s being returned from our query, the output of this function is actually information that multiple people are contributing to.So we might say intuitive measure here might be a threshold or various other functions which have a much smaller sensitivity profile.
#Calculate L1 Sensitivity For Threshold
We will calculate the sensitivity for a new kind of function,and that function is the threshold function. We’re first going to define a new query, which is computing this threshold function, then we’re going to create 10 databases of size 10 and compute this function over them with a threshold of five and calculate the sensitivity each time.
So, step number one is to create this new query type. def query like we did before, which passes in a database, and then it returns the sum over the database and whether or not that is greater than a certain threshold, which we’ll set the default to be five.The priority that this actually returns a binary value, because it’s either greater than or less than that value => cast it to a float because this is our sensitivity function, it’s actually equipped to use.
The interesting thing here is that we run this query on data database.So, let’s say, let’s grab a database iteration function up here.Let’s create 10.So, if we look at database and we just compute a sum over it, notice these changes.There is some deviation,so sometimes it’s going to be greater than a threshold and sometimes it’s going to be less than threshold.If we actually query the database, we see this. sometimes it’s one, sometimes it’s zero and the other interesting need to be consider.db.sum,and let’s look for one that’s six.When db.sum greater than five,returns one or returns true and when it returns one, it’s returning a true, and when it turns zero, it’s actually returning a false.
When we’re querying the database, we get one. The interesting thing here is that this means that there are going to be a parallel databases that could be a five => this is the nature of how the output of this querycan change if we remove an individual from the dataset, because if we remove someone it could cause this sum to go from six to five and thus the threshold to go from true to false.
However, if we have a database with only four people in it for example,well then the query is false ,But no matter who we remove, this query is going to continue to be false because removing someone from the dataset only makes the output of the query smaller. So, this means that for some databases, we would have positive sensitivity and for other databases it seems almost as if we would have no sensitivity because no matter how many people we removed or who we removed, the output of this threshold query is no different sens_f equals sensitivity of the query and number of entries equals five. Now let’s do this 10 times and print out the sensitivity of each the theory actually isn’t holding up It looks as sometimes the sum is greater than one, sometimes it’s greater than five,sometimes the sum is less than five.
However, the sensitivity itself changes and what these actually correspond to, is a non-zero sensitivity is when the sum of the database is exactly six.
Because of the sum of the databases is exactly six, then it’s sitting right above the threshold, and so when we remove someone,it can actually drop down to five thus causing the output of the query to be different and the sensitivity to be one.But the rest of the time,when the sum of the database is not six., it’s significantly below or significantly above six,then the database doesn’t change the output of the query.This mean => in previous examples, we’ve seen a constant sensitivity.We saw some function over binary always has a sensitivity of one,and when our mean had a very consistent sensitivity as well, but this has a variable sensitivity.In a context of say, a sum function,we can simply know that the maximum sensitivity of a sum over a database is the maximum value or maximum range of values that any one of those items that you could remove would take on.In this particular case,however this to be different because the sensitivity here is somewhat dataset specific =>Theoretically, the maximum sensitivity of a threshold is always one as well.Because the maximum amount that they were moving someone from a function including a threshold is one. That’s the most that it can change,you can either go from a one to a zero.However, if we actually take a peek at the data, we can tell that sometimes it’s going to be one and sometimes it’s not going to be one.When implementing differential privacy for the first time,this is how you want to compute sensitivity.“Hey, I’m doing a threshold that means with sensitivity is one,and that’s what I’m going to use.”However, there are some more advanced techniques one of which we’ll look later that actually allows you to kind of take a peek at the data and try to understand what’s called a data conditioned sensitivity.
#Intro Perform A Differencing Attack
In this concept, we’re going to explore how to compromise or attack differential privacy.In particular, we want to talk about the simplest attack, which is called a differencing attack.Sadly, none of the functions we’ve looked at so far are differentially private, despite them having varying levels of sensitivity.We’re going to demonstrate this here by using this differencing attack.Let’s say we want to figure out a specific person’s value in the database.If we’re able to perform a query using a sum, all we would have to do is query for the sum of the entire database and then the sum of the entire database without that person. In SQL, this might look something like this:
As you can see, by comparing these two SQL queries, we can determine whether John Doe actually had cancer.So what I’d like for you to do in the next project is perform this differencing attack using the sum, mean, and threshold query functions we already created.The purpose of this exercise is to give you an intuition for how privacy can fail in these environments.
#Perform A Differencing Attack
In this project, we want to perform a basic differencing attack to divulge what the value is in database, specifically the value on row 10.The way we’re going to do this, we’re going to perform two different queries against the database:
- One, which is a query that includes row 10
- A query against the database excluding row 10.
The idea is we want be able to compare these two different queries and determine what the exact value of row 10 actually is. So the first we want to do is initialize a database, we’ll put a 100 values in it.
db, _ = create_db_and_parallels(100)
They’re performing a query with a value that’s missing.As you should be familiar with this point,given the intuitions around difference or privacy that we’ve been formulating,this differencing attack is very close to the heartof the intuitions behind differential privacy. So as we form differential private techniques,we want them to specifically be immune to these kinds of attacks.As we’ll see in a formal definition of differential privacy, this is very,very close to the- there’s a constraint that we must satisfying or satisfy differential privacy to a certain threshold. See you in the next lesson.