Privacy lets us keep our value as a functional member of society. It gives us the power to choose our thoughts and feelings and who we share them with.
Data is a significant resource for any company. Analyzing this data can help organizations understand their users, their behaviours, and train machine learning (ML) models. The right use of data can have an immensely positive impact on a company’s success.
Organizations have collected and shared user data without adequate attention to user privacy. Cambridge Analytica controversy is one of the many instances where the user’s right to privacy was compromised. Currently, there’s a lot of news about FaceApp privacy misuse.
General Data Protection Regulation (GDPR):
The GDPR law will apply with a direct effect in all the countries of the European Union (EU). GDPR outlines a set of rules about handling personal data of the users and must be complied by all the organizations that collect and utilize user data in any form.
- A service requires affirmative user consent before saving any personal data.
- If a service has greater than 250 employees or deals with sensitive data, they need a privacy officer.
- The right to forget (they must remove personal data upon request).
Although the GDPR is for the EU, other countries are likewise working on providing such laws.
In India, we have:
Why Privacy in Data Science?
We are moving towards the digital future; Almost every organization uses Artificial Intelligence (AI) or Machine Learning (ML) to optimize resource utilization and enhance productivity.
ML models need humongous amounts of data to train and achieve optimal performance. We generally obtain this training data from real-world examples, i.e. from users. Multiple users send data to the central server for training, exposing user private information to the server. This is known as Centralized Learning. The organization has full control over the data.
As an ML model learns from the data, it also tries to memorize certain details that it shouldn’t. This can be detrimental when dealing with sensitive data. The model tends to learn private information about the user. For instance, consider a service that prognosticates for cancer. A user (Bob) uploads an image to the service. The model will also learn facts (e.g. Bob has cancer or not) which the user might not necessarily be comfortable sharing. Upon inspecting the model, anyone can access confidential information of Bob or other users used for training, thus violating their privacy.
Protection of the input, the prediction, and the ML model itself from theft is of utmost importance.
How to Protect User Privacy?
If you are a data scientist working with sensitive data, implementing the right steps to secure data can be challenging. Let’s discuss three commonly used privacy techniques:
1. Federated Learning
2. Secure Multi-Party Computation
3. Differential Privacy
In Federated Learning, training is implemented on client devices instead of the central server. Thus, client data is not exposed to the central server. We deliver the server model to each client for training on their data. After training, we then send the updated model back to the server where we aggregate all the client models and build the new improved model. This ensures that user data remains on the client device only.
For instance, to predict the next word while typing, we need not send user data (sentences) to the cloud. The model can be locally trained on the device, and the updated model can then be moved to the cloud, keeping data on the local device.
Although Federated Learning helps in keeping the user data private, it doesn’t solve the model memorization issue. Additionally, model aggregation is performed at the server that receives the updated model from the users. We can learn a lot about the user by looking at their uploaded model. Sometimes, we can restore their training data perfectly, thus leaking privacy.
To overcome this issue, we can aggregate all the updated models into a single model and then send it to the server. Now the model sent to the server will contain updates from all the users, preventing data recovery of the exact user and thus better protecting user privacy. To perform this aggregation, a trusted worker (Trusted Aggregator) is used.
Secure Multi-Party Computation(SMPC)
In SMPC, the user’s data is split into N shares, encrypted and distributed among different workers. By doing this, no single worker can retrieve the secret data without the consent of other workers. All workers should agree and offer their share to decrypt the content. Simply put, all workers together have the primary key.
If the secret value Marc is 8 and the number of workers is 3 (Marc, Bob, Alice), then we split the secret value into three shares.
Q = 1234567891011 //should be large enough to cover all space
x = 8 // secret data
def encrypt(x): // encrypt function
share_a = random.randint(0,Q)
share_b = random.randint(0,Q)
share_c = (x - share_a - share_b) % Q
return (share_a, share_b, share_c)Marc,Bob,Alice = encrypt(x) => (65854818,500742126,752775741)
As you can see, Marc’s data is split into 3 parts.
For a particular user to decrypt this secret value will need all 3 shares of the data.
def decrypt(*shares): // decrypt function
return sum(shares) % Qdecrypt(Marc,Bob,Alice) => 8 // secret data
This makes it impossible for Bob and Alice to know the secret value of Marc. The best part about SMPC is that we can do computation on each share of the data without disclosing the actual data of Marc.
x = encrypt(5)
y = encrypt(5)def add(x, y):
z = list()
z.append((x + y) % Q)
z.append((x + y) % Q)
z.append((x + y) % Q)decrypt(*add(x,y)) => 10
Amazing, isn’t it?! SMPC protocols allow the following operations:
Thus, SMPC encrypts the model and its weights and stops workers from accessing data from other workers. However, the model memorization issue persists.
Differential Privacy is used to ensure that the model doesn’t learn facts or remember certain details about the users and only learns what it should learn. This is done by randomizing user IDs, making slight changes in numeric values and adding noise.
Simply randomizing user IDs are not beneficial as data can still be revealed by reverse-engineering. For instance, Netflix, in 2007, released its user rating dataset for a competition. The dataset had no personal identifying information, but still, researchers could retrieve users’ details.
Let’s see what happens when random noise is added to the query by curators (who need to release statistics).
To add noise, Differentially Private algorithm is used. It is immune to reverse-engineering attacks as discussed. Thus, when the dataset is distributed, it will be noisy and not precise, making it difficult to breach privacy.
How much noise should we add to minimize privacy leakage?
The more information you query to the database, the more noise has to be added to minimize privacy leakage. Total permitted leakage is known as the privacy budget. If it’s too high, then we will leak privacy, if too low then results will tend to have very low accuracy making it unusable.
The limitation of adding noise to the query is that information can be revealed by “estimation from repeated questions”. After multiple attempts (requests), we can estimate the ground truth. With an increasing number of questions, there is a possibility of privacy infringement.
A better approach is to add noise while collecting raw data. Now privacy leakage will be independent of the number of queries. Adding noise at the local level makes this a very powerful technique.
An application of Differential privacy is Google’s Randomized Aggregatable Privacy-Preserving Ordinal Response (RAPPOR). The main goal of RAPPOR is to offer anonymity to those taking part.
A simple algorithm:
STEP ONEflip coin
go to step 2STEP TWOflip coin
In this algorithm, the user will flip the coin, if it’s ‘heads’ then he will respond truthfully else go to step 2. In step 2, he again flips the coin keeping the condition before.
In step 1, there’s a 50% probability of response to be true and in step 2 a 50% probability of 50% probability thus 25% of been true. This algorithm will have a truthful response rate of 75%. Thus adding 25% noise to the database.
This is a very concise introduction to Differential Privacy. Maybe I can write a separate in-depth article about it at a later point! Let me know in the comments!!!
We have reviewed some techniques in data science that could help preserve user privacy. We can use a combination of these techniques to keep both data and the model private. Pysyft is a nice library that would help you implement these techniques very easily.
This was a brief introduction and there are numerous fascinating aspects to explore in this field! I will publish more blog posts on privacy and AI in the next few weeks. Follow me to get notified.
Hello! This is my first article. I hope you liked it. Leave a few claps and comments, it would mean a lot. Your feedback is very important to me. Follow me if you loved it!
Thanks to Vimaladevi Paruchuru, Apurva Mhamal, Akshatha Sonal, Trishna Kouthankar, Avila Naik, Hanish Naik, Sarfraz Kokatnur.