A Brief Survey on Privacy Preserving Machine Learning Techniques

Published in

AI Network

8 min readAug 29, 2018

Dear AI Network Community

What if we told you that you can train your next AlphaGo machine learning model with your private data on a peer-to-peer network, where workers compete with each other to process your job and consequently you get to pay the lowest price possible in the market, without ever exposing your data or your model? You’d say it’s too good to be true, right? That’s because it is. At least for now.

AI Network is working toward building a global P2P platform on which machine owners can make better use of their idle computing power and ML researchers can develop their models with reasonable execution costs. However, with the amount of data that will be transferred back and forth between remote, unknown and untrusted machines, it’s not difficult to see that privacy is going to be one big hairy issue. And as far as privacy goes, it’s both the service provider and the data provider’s responsibilities to keep the data protected. So, AI Network dev team thought it’d be good to share some of the common threats to data privacy, as well as some countermeasures (plus their limitations).

In a recent publication “Privacy Preserving Machine Learning: Threats and Solutions”, Al-Rubaie et al. (henceforth “the authors”) categorize the possible privacy threats in ML into 4 types and propose techniques to achieve privacy-preserving ML (PPML). Although the paper doesn’t specifically deal with ML on cloud nor on a P2P network, the problems they discuss are part of the problems we as the dev team and the future participants of AI Network will have to tackle. The four categories of privacy attacks in ML discussed in the paper include: reconstruction attacks, model inversion attacks, membership inference attacks, and de-anonymization. Let’s take a closer look at each of them.

Reconstruction attack

When an adversary reconstructs the raw private data from your ML feature vectors, it’s called a reconstruction attack. Here the feature vectors are formed from the raw data and are used with associated labels to train and test ML models. The authors claim that adversaries can get their hands on the raw data if they have access to the feature vectors. In AI Network, a client gives workers the data and the code to execute. As such, if the client wants to train on a private dataset, she/he will have to make sure that the data is cryptographically secure and the feature vectors are not explicitly stored in the model.

Not storing feature vectors in your model simply means choosing ML algorithms that don’t store them, e.g. avoiding SVM and kNN. Cryptographically protecting data, however, is a more nuanced task. While theoretically brilliant solutions have been proposed to the problem of cryptographic data protection, these problem generally suffer from various practical limitations. We’ll focus on the four most common techniques for cryptographic data protection, namely homomorphic encryption, garbled circuits, secret sharing, and secure processors.

Homomorphic Encryption

Homomorphic Encryption (HE) allows you to encrypt your data and share them with a stranger, while also allowing the stranger to perform operations on the data in their encrypted form and give the output back to you, encrypted! During the entire process, that stranger will not be able to touch your decrypted data. This looks like the perfect encryption technique that will solve all our problems. Except, there’s a catch. HE encrypts each bit of the input data which is used to run through an enormous boolean circuit that represents a function. Basically you evaluate the result one logic gate at a time. What’s more, in order to execute an arbitrary program on encrypted data, you need even more computationally expensive bootstrapping to reduce the amount of noise introduced due to encryption.

The original Fully Homomorphic Encryption (FHE) took more than 900 seconds to add two 32 bit numbers and more than 18 hours to multiply them [1]. Considering that our regular computers run in nanoseconds (19 nsec = 1s) these days, those numbers sound absurd. Since then, many improvements in HE and its variants have been proposed that brought the number down to couple of milliseconds [2,3,4]. However, HE memory consumption and computation overheads are still not acceptable for many applications compared to non-HE methods that are used in practice.*

Garbled Circuits and Secret Sharing

When two parties want to collaborate on evaluating a model with their datasets but don’t necessarily want to share their data with each other, they can garble their inputs as well as create a Garbled Circuit (GC) out of the function they want to execute. They will be able to get garbled outputs without either of them knowing about the other person’s data or computational procedure.[5] GC enables secure two-party computation, whereas secret sharing protocols take care of secure multi-party computation. GC and secret sharing schemes also suffer from high computational cost, and AES computation takes about 0.2s.[6]

Secure Processors

And then there are secure processors such as Intel’s SGX that incorporate hardware level security to help mitigate privacy and security breaches. SGX does so by creating an enclave memory that “cannot be read or written from outside the enclave regardless of current privilege level and CPU mode [7].” An SGX application is comprised of a trusted component (an enclave) and an untrusted component (the rest of the application), with only the trusted component having access to your private data. As one of the main objectives set out by Intel is protecting data in both its encrypted form as well as its decrypted form at run-time, SGX could be highly beneficial to the AI Network ecosystem. For example, if an admin creates a cluster of SGX, it could protect workers from potentially malicious code and enhance client privacy, thereby strengthening the trust of clients using the service and increasing their use of AI Network. Nevertheless, SGX isn’t a silver bullet that can resolve every privacy and security concern. There have been several documented security limitations of SGX[8,9], and in applications with high system calls frequency, its performance degradation can be significant.[10]

Model inversion attack

Even when the adversary doesn’t have access to the data or the feature vectors, she/he can carry out a model inversion attack using the test results to create feature vectors similar to the ones used to train a ML model. If the adversary has knowledge of the model’s predictions as well as their confidence values, which indicate how confident we are with the results, she/he could try to find the “inverse” of the original ML problem and retrieve sensitive client data. For instance, Fredrikson et al. were able to form one of the faces from face recognition training data that people could recognize in a crowd with more than 80% accuracy on average, given only the person’s name and API access to the ML model.[11]

Fortunately, the authors assure us that this type of attack can be prevented or the attacker’s success rates can be significantly reduced. This can be done by reporting the rounded confidence values or only reporting the predicted class labels without revealing the confidence values.

Membership inference attack

Despite the fact that you thoroughly encrypted your input data, potential adversaries can still use your model results to “infer” if a particular subset of data was in the data used to train the model. This is called a membership inference attack. The authors introduce perturbation approaches as effective measures to prevent such attacks. Perturbation approaches are privacy preserving techniques that “perturb” data, meaning they make data noisy or incomprehensible.

Differential Privacy (DP) techniques are famous for being used by Apple since 2016 to “help discover the usage patterns of a large number of users without compromising individual privacy [12].” In order to obscure the link from the data to the person, DP adds random noise to the data, making a trade-off between accuracy and privacy. In DP, there’s a concept called “privacy budget” which is essentially the amount of data leakage that’s allowed. Frederikson et al. showed that depending on how you set your privacy budget, your trained model’s predictions can vary considerably. In their example, they trained a model that makes dosing decisions for clinical patients and discovered that “for privacy budgets effective at preventing attacks, patients would be exposed to increased risk of stroke, bleeding events, and mortality [13].”

Dimensionality Reduction (DR) is another perturbation technique that trades accuracy for privacy by projecting the data onto a lower level hyperplane. However, the authors note that an approximation of the data can still be retrieved from the reduced dimensions. Therefore DR should only be used in conjunction with other privacy enhancing techniques.

De-anonymization

De-anonymization, or re-identification, can also compromise privacy through utilizing auxiliary information. Even without the data or after personal identifiers have been removed from the data, adversaries can gather other knowledge to deduce an individuals’ personal information. An infamous example is a study done by Narayanan et al. on the Netflix Prize dataset.[14] Without any personally identifiable Netflix subscriber information, they were able to “demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset” and they also “successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.”

Various privacy preserving measures are out there and are being developed, but most of them have drawbacks and are still not practical enough to be used in real-world applications like AI Network. The situation can and probably will shift as security researchers make progress on more efficient and effective mechanisms. However until then, anyone who wants to run their code on another machine will have to be wary of these threats to privacy and understand the trade-offs they make when choosing preventive measures.

* There are slightly faster but restricted versions of HE such as Somewhat Homomorphic Encryption that supports only one operation (e.g. addition) and Leveled Homomorphic Encryption that gets rid of the expensive bootstrapping steps but at the same time limits the depth of the circuit. See link1 and link2 for more information.

[1] Xiao, Liangliang, Osbert Bastani, and I-Ling Yen. “An Efficient Homomorphic Encryption Protocol for Multi-User Systems.” IACR Cryptology ePrint Archive 2012 (2012): 193.
[2] Halevi, Shai, and Victor Shoup. “Algorithms in helib.” International Cryptology Conference. Springer, Berlin, Heidelberg, 2014.
[3] Chillotti, Ilaria, et al. Improving TFHE: faster packed homomorphic operations and efficient circuit bootstrapping. Cryptology ePrint Archive, Report 2017/430, 2017.
[4] Hesamifard, Ehsan, Hassan Takabi, and Mehdi Ghasemi. “CryptoDL: Deep Neural Networks over Encrypted Data.” arXiv preprint arXiv:1711.05189 (2017).
[5] http://web.mit.edu/sonka89/www/papers/2017ygc.pdf (Yakoubov, Sophia. “A Gentle Introduction to Yao’s Garbled Circuits.”)
[6] Huang, Yan, et al. “Faster secure two-party computation using garbled circuits.” USENIX Security Symposium. Vol. 201. №1. 2011.
[7] https://software.intel.com/en-us/sgx-sdk/details
[8] Schwarz, Michael, et al. “Malware guard extension: Using SGX to conceal cache attacks.” International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, Cham, 2017.
[9] https://github.com/lsds/spectre-attack-sgx
[10] Weisse, Ofir, Valeria Bertacco, and Todd Austin. “Regaining lost cycles with HotCalls: A fast interface for SGX secure enclaves.” ACM SIGARCH Computer Architecture News. Vol. 45. №2. ACM, 2017.
[11] Fredrikson, Matt, Somesh Jha, and Thomas Ristenpart. “Model inversion attacks that exploit confidence information and basic countermeasures.” Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 2015.
[12]https://developer.apple.com/library/archive/releasenotes/General/WhatsNewIniOS/Articles/iOS10.html
[13] Fredrikson, Matthew, et al. “Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing.” USENIX Security Symposium. 2014.
[14] Narayanan, Arvind and Shmatikov, Vitaly (2008) ‘Robust de-anonymization of large sparse datasets’. Proceedings — IEEE Symposium on Security and Privacy, pp. 111–125.