A Brief Survey on Privacy Preserving Machine Learning Techniques

AI Network
Aug 29, 2018 · 8 min read

Dear AI Network Community

What if we told you that you can train your next AlphaGo machine learning model with your private data on a peer-to-peer network, where workers compete each other to process your job and consequently you get to pay the lowest price possible in the market, without ever exposing your data or your model? You’d say it’s too good to be true, right? That’s because it is. At least for now.

AI Network is working toward building the global P2P platform on which machine owners can make better use of their idle computing power and ML researchers can develop their models with reasonable execution costs. However, with the amount of data that will be transferred back and forth between remote, unknown and untrusted machines, it’s not difficult to see that privacy is going to be one big hairy issue. And as far as privacy goes, it’s both the service provider and the data provider’s responsibilities to keep the data protected. So, AI Network dev team thought it’d be good to share some of the common threats to data privacy, as well as some countermeasures (plus their limitations).

In a recent publication “Privacy Preserving Machine Learning: Threats and Solutions”, Al-Rubaie et al. (henceforth “the authors”) categorize the possible privacy threats in ML into 4 types and propose techniques to achieve privacy-preserving ML (PPML). Although the paper doesn’t specifically deal with ML on cloud nor on a P2P network, the problems they discuss are part of the problems we as the dev team and the future participants of AI Network will have to tackle. The four categories of privacy attacks in ML discussed in the paper include: reconstruction attacks, model inversion attacks, membership inference attacks, and de-anonymization. Let’s take a closer look at each of them.

Reconstruction attack

When the adversary reconstructs the raw private data from your ML feature vectors, it’s called a reconstruction attack. Here the feature vectors are formed from the raw data and are used with associated labels to train and test ML models. The authors claim that the adversary can get their hands on the raw data if they have access to the feature vectors. In AI Network, a client gives workers the data and the code to execute, so if the client wants to train on a private dataset, she/he will have to make sure that the data is cryptographically secure and the feature vectors are not explicitly stored in the model.

Not storing feature vectors in your model simply means choosing ML algorithms that don’t store them, e.g. avoiding SVM and kNN. Cryptographically protecting data, however, is a more nuanced task with theoretically brilliant but practically limited solutions that have been developed by researchers over many years. We’ll focus on the four most common techniques, namely homomorphic encryption, garbled circuits, secret sharing, and secure processors.

Homomorphic Encryption (HE) allows you to encrypt your data, share them with a stranger, the stranger performs operations on the data in their encrypted form, and gives the output back to you, encrypted! During the entire process, that stranger will not be able to touch your decrypted data. This looks like the perfect encryption technique that will solve all our problems. Except, there’s a catch. HE encrypts each bit of the input data which is used to run through an enormous boolean circuit that represents a function. Basically you evaluate the result one logic gate a time. What’s more, in order to execute an arbitrary program on encrypted data, you need even more computationally expensive bootstrapping to reduce the amount of noise introduced due to encryption.

The original Fully Homomorphic Encryption (FHE) took more than 900 seconds to add two 32 bit numbers and more than 18 hours to multiply them [1]. Considering that our regular computers run in nanoseconds (19 nsec = 1s) these days, those numbers sound absurd. Since then, many improvements in HE and its variants have been proposed that brought the number down to couple of milliseconds [2,3,4], but the memory consumption and computation overheads are still not acceptable for many applications compared to non-HE methods that are used in practice.*

When two parties want to collaborate on evaluating a model with their datasets but don’t necessarily want to share their data with each other, they can garble their inputs as well as create a Garbled Circuit (GC) out of the function they want to execute. They will be able to get garbled outputs without either of them knowing about the other person’s data or about the computation.[5] GC enables secure two-party computation, whereas secret sharing protocols take care of secure multi-party computation. GC and secret sharing schemes also suffer from high computational cost, and AES computation takes about 0.2s.[6]

And then there are secure processors such as Intel’s SGX that incorporate hardware level security to help mitigate privacy and security breaches. SGX does so by creating an enclave memory that “cannot be read or written from outside the enclave regardless of current privilege level and CPU mode [7].” An SGX application comprises a trusted component (an enclave) and an untrusted component (the rest of the application), and only the trusted component has the access to your private data. As one of the objectives that Intel set out is protecting data in both its encrypted form as well as in decrypted form at run-time, SGX could be highly beneficial to the AI Network ecosystem. For example, if an admin creates a cluster of SGX, it could protect workers from potentially malicious code and enhance privacy of clients, thereby strengthening the trust of clients in the service and increasing their use of AI Network. Nevertheless, SGX isn’t a silver bullet that can resolve every privacy and security concern. There have been several documented security limitations of SGX[8,9], and in applications with high system calls frequency, its performance degradation can be significant.[10]

Model inversion attack

Even when the adversary doesn’t have access to the data or the feature vectors, she/he can carry out a model inversion attack and create feature vectors similar to the ones used to train a ML model using the test results. If the adversary has knowledge of the model’s predictions as well as their confidence values, which indicate how confident we are with the results, she/he could try to find the “inverse” of the original ML problem and retrieve the sensitive data. For instance, Fredrikson et al. were able to form one of the faces from a face recognition training data that people could recognize in a crowd with more than 80% accuracy on average, given only the person’s name and API access to the ML model.[11]

Fortunately, the authors assure that this type of attacks can be prevented or the attacker’s success rates can be significantly reduced by reporting the rounded confidence values or only reporting the predicted class labels without revealing the confidence values.

Membership inference attack

Despite the fact that you thoroughly encrypted your input data, the adversary can still use the results your model gives and “infer” if a particular data was in the data used to train the model. This is called a membership inference attack. The authors introduce perturbation approaches as effective measures to prevent such attacks. Perturbation approaches are privacy preserving techniques that “perturb” data, meaning they make data noisy or incomprehensible.

Differential Privacy (DP) techniques are famous for being used by Apple since 2016 to “help discover the usage patterns of a large number of users without compromising individual privacy [12].” In order to obscure the link from the data to the person, DP adds random noise to the data and makes a trade-off between accuracy and privacy. In DP, there’s a concept of “privacy budget” which is essentially the amount of data leakage that’s allowed. Frederikson et al. showed that depending on how you set your privacy budget, your trained model’s predictions can vary considerably. In their example, they trained a model that makes dosing decisions for clinical patients and discovered that “for privacy budgets effective at preventing attacks, patients would be exposed to increased risk of stroke, bleeding events, and mortality [13].”

Dimensionality Reduction (DR) is another perturbation technique that trades accuracy for privacy by projecting the data onto a lower level hyperplane. However, the authors note that an approximation of the data can still be retrieved from the reduced dimensions and that DR should be used in conjunction with other privacy enhancing techniques.

De-anonymization

De-anonymization, or re-identification, can also compromise privacy through utilizing auxiliary information. Even without the data or after personal identifiers have been removed from the data, the adversary can gather other knowledge to deduce the individuals’ personal information. An infamous example is a study done by Narayanan et al. on the Netflix Prize dataset.[14] Without any personally identifiable information of the Netflix subscribers, they were able to “demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset” and they also “successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.”

Various privacy preserving measures are out there and are being developed, but most of them have drawbacks and are still not practical enough to be used in real-world applications like AI Network. The situation can and probably will shift as the security researchers make progress on more efficient and effective mechanisms, but until then, anyone who wants to run their code on another machine will have to be wary of these threats to privacy and understand the trade-offs they make when they choose preventive measures.

* There are slightly faster but restricted versions of HE such as Somewhat Homomorphic Encryption that supports only one operation (e.g. addition) and Leveled Homomorphic Encryption that gets rid of the expensive bootstrapping steps but at the same time limits the depth of the circuit. See link1 and link2 for more information.

[1] Xiao, Liangliang, Osbert Bastani, and I-Ling Yen. “An Efficient Homomorphic Encryption Protocol for Multi-User Systems.” IACR Cryptology ePrint Archive 2012 (2012): 193.

[2] Halevi, Shai, and Victor Shoup. “Algorithms in helib.” International Cryptology Conference. Springer, Berlin, Heidelberg, 2014.

[3] Chillotti, Ilaria, et al. Improving TFHE: faster packed homomorphic operations and efficient circuit bootstrapping. Cryptology ePrint Archive, Report 2017/430, 2017.

[4] Hesamifard, Ehsan, Hassan Takabi, and Mehdi Ghasemi. “CryptoDL: Deep Neural Networks over Encrypted Data.” arXiv preprint arXiv:1711.05189 (2017).

[5] http://web.mit.edu/sonka89/www/papers/2017ygc.pdf (Yakoubov, Sophia. “A Gentle Introduction to Yao’s Garbled Circuits.”)

[6] Huang, Yan, et al. “Faster secure two-party computation using garbled circuits.” USENIX Security Symposium. Vol. 201. №1. 2011.

[7] https://software.intel.com/en-us/sgx-sdk/details

[8] Schwarz, Michael, et al. “Malware guard extension: Using SGX to conceal cache attacks.” International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, Cham, 2017.

[9] https://github.com/lsds/spectre-attack-sgx

[10] Weisse, Ofir, Valeria Bertacco, and Todd Austin. “Regaining lost cycles with HotCalls: A fast interface for SGX secure enclaves.” ACM SIGARCH Computer Architecture News. Vol. 45. №2. ACM, 2017.

[11] Fredrikson, Matt, Somesh Jha, and Thomas Ristenpart. “Model inversion attacks that exploit confidence information and basic countermeasures.” Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 2015.

[12]https://developer.apple.com/library/archive/releasenotes/General/WhatsNewIniOS/Articles/iOS10.html

[13] Fredrikson, Matthew, et al. “Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing.” USENIX Security Symposium. 2014.

[14] Narayanan, Arvind and Shmatikov, Vitaly (2008) ‘Robust de-anonymization of large sparse datasets’. Proceedings — IEEE Symposium on Security and Privacy, pp. 111–125.


AI Network Blog

Build a transparent network to measure the value of computer and human activities

AI Network

Written by

AI Network official account. Please contact me here. info@ainetwork.ai

AI Network Blog

Build a transparent network to measure the value of computer and human activities

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade