This article sets up the stage for mathematics to play it’s game in the world of privacy.
Introduction & Definition
So like in all our mathematical subjects that we deal we create a formal mathematical structure to study this subject. Like in all structures we need to have entities and define some properties on them to start playing around with them and see the mysteries unfold. The structure that we will establish here is called Differential Privacy. The term Differential Privacy is not exactly a structure but a definition and it is defined to capture a promise that a data curator (the person who collects data) has to make to subject (person who participates in data collection process). The informal definition or to state better the statement of the promise goes as follows :
It is the statement of the promise made by a data holder (aka curator) to the subject participating in the study.
“You will not be affected, adversely or otherwise,
by allowing your data to be used in any study or analysis,
no matter what other studies, data sets or information sources (in short any auxiliary information) are available”
Some subtle points
You see that this is a powerful definition as such, why powerful because this captures somethings that we otherwise may not have thought of to be possible. The Curator’s promise states that not only the current participation will not make any difference but also if anyone has participated before or will participate in future, in collection of data, it will contribute nothing to the understanding of the curator regarding the individual.
Another important note on the definition is this :
Here we have that “the being affected” clause covers only the being affected by participation , not by the conclusion of the study itself. To understand better let’s take an example. Consider this scenario :
If during a data collection the subjects were asked about smoking and the study concludes that smoking causes cancer. This increases the premium of the insurance for a smoker. In such a case from differential privacy point of view though the smoker was affected the privacy was not compromised even if the smoker had taken part in the study. This is because the study’s conclusion was what affected the individual and not the participation itself the aggregate statistics would have remained the same irrespective of the participation of the smoker.
The elegance of such a definition is in the point of view that though the privacy constraints are strict it still keeps up the utility that data has.
Jumping into the game
Though the above definition is not so rigorous mathematically the above definition will exactly be the thing basing on which we will establish our rigor. Before we dive deeper into this, a better motivation would be to show an example of how privacy unfolds. We will go through a simple algorithm, known as Randomized Response Technique, that was proposed long before privacy became a serious concern and this was used by sociologist to conduct survey on incriminating and taboo behavior (drug usage, sex lives, etc.)
The algorithm goes as follows :
You (the curator) ask a question `Q` which will be responded with either an answer of ‘yes’ or ‘no’ (example : Do you smoke ? ). To answer this question the subject performs the following procedure :
1. Toss a coin, if heads answer truthfully
2.If the toss results in tails flip another coin
3.Based on the second flip and irrespective of whether you committed such action or not if it is heads on the second flip answer “no”, if tails answer “yes”.
let us calculate the probability of someone saying yes (y):
i.e. probability a person will say ‘yes’ is equal to probability that the person got a heads and he responded to the question with a ‘yes’ additionally the probability that he got tails twice and thus ended up answering ‘yes’.
We have this to be :
Thus we obtain :
We notice here the minimum probability here is 1/4, this is because of the noise introduced by the algorithm. Now anyone who says “yes” has a plausible deniability of 25%, that is he can always claim that he just said yes because the tails fell twice (even though the truth itself may have been told).
The Algorithm’s magic
You might be wondering what so great about this algorithm, we might have added privacy but adding randomness (from the second coin flip) we have added noise and we will not be able to make use of this data well; but this where you might be wrong. Our original aim was never to find out regarding the individual behavior but the collective behavior this means we will need to know only something like what is the average number of people who answered ‘yes’ and the algorithm helps us get the exact average without much fretting, The advantage of this algorithm is that we can still de-skew the result to obtain what might have been the actual population statistic. This can be done as follows suppose we have skewed_result to be (sr) and true result to be (tr), then :
in general we can provide weightage to how much noise needs to be added (the weight is in general between 0 and 1), here we have provided equal weightage to both data and noise which is 0.5 as the coin is fair. But if we had an unfair coin the weightage need not be 1:1. And from above we can calculate what is the true result as follows :
Now Let’s see a practical illustration. Assume you want to do a survey on sensitive issue like what percentage of people have counterfeit software. Naturally people will not answer you directly and may suspect you. Knowing the above RRT algorithm you may use it for your survey here.
Assume that 40% of the people used counterfeit software (you don’t know this) and since you used RRT you will have the following result :
1. In 50% case people have to tell truth from first coin toss (i.e. if it falls heads) so we get = 50% out of 40% = 40*0.5 = 20%
2. In the other 50% case people are going to answer randomly so we will get = 50% out of 50 = 50*0.5 = 25%
so the skewed proportion we will get is = 25% + 20% = 45%
Now, by de-skewing this result we get = 45*2-50 = 40% which is the actual proportion!!!
So in this article we dealt with two important things one is the promise of differential privacy and the other is an algorithm known as RRT used to create a private survey, these were just to get a feel for the subject with the coming up articles we will go full throttle on the math associated with the subject.