With Ben DeCoste.
In the last year, we’ve seen an explosion of interest in privacy-preserving technologies. With the continuing adoption of artificial intelligence, we have a growing need to maintain control over how data is used and distributed. Historically, we’ve had to choose between the use of technology and privacy. However, secure computation offers us a third choice: the ability to take advantage of artificial intelligence while retaining control over our data. We can now build products that were previously impossible due to competition, compliance, or lack of trust.
Kaggle recently hosted a competition to detect the likelihood that a person would default on their mortgage. The datasets given to participants of this competition were similar to those one would find at a large company; there are multiple sources and the data is not uniformly formatted. After some feature engineering on the Kaggle provided data, we were able to come up with a model that provided reasonable training results. Our model scores at 92% accuracy and .768 area under the ROC curve. The top model in this competition scored 0.80570.
The data used in training contained information such as credit card balance, credit score, repayment history, and more. Much of this information would likely be considered sensitive by the applying consumer. In an ideal world, the consumer would not need to share their private information unless the lending agency was going to grant them a mortgage. Offering a risk-free mortgage assessment could be a competitive advantage for the lending agency.
We have trained and deployed this model in a few different scenarios of interest comparing against a plaintext deployment (i.e. no privacy). We evaluate what this model might look like in a Trusted Execution Environment (TEE), as well as using an encryption protocol (MPC in tf-encrypted).
To create a model that we can deploy with tf-encrypted and a Trusted Execution Environment we must first train it. All code for training and deploying the model lives here.
We’re creating a basic logistic regression model so we can make use of Tensorflow’s Keras API to train it. We were able to achieve the aforementioned model performance using a Dense layer with no bias, followed by a sigmoid activation. Adam performed the best out of all the optimizers we tried. The model predicts a yes or no answer so we have used binary cross entropy for our loss function.
Before training, we spent some time preprocessing the data and feature engineering. The Kaggle competition provides a total of 8 datasets, these datasets include detailed information about the loan, repayment history, credit cards, and previous credits provided by other financial institutions. Each dataset has been aggregated before we merged them into a single dataset. During the aggregation process, we calculated some key statistics, such as credit to income ratio, average debt with other financial institutions, the average instalment amount, count of credit card lines, etc. For categorical variables, we used one hot encoding. The final dataset contains 528 features.
Once trained we are able to make predictions directly from Tensorflow and private predictions from tf-encrypted as well as Trusted Execution Environments!
tf-encrypted (TFE) provides a software solution to giving a model privacy. The goal of TFE is to provide a framework for experimenting with and deploying secure algorithms. Multiparty computation (MPC) is the most advance of these protocols in TFE, and will be what we use to secure our model.
In our example, the lending agency has trained their model on in-house data, they can now provide their prospective customers with privacy by using TFE. Depending on the agencies & customers needs, they can choose to protect their model, the user’s input, or both. We are going to evaluate the most complete example, securing both model and input.
We take the same model explained in the previous section and pass the exported graph to TFE, and that is all we need to do! The model now supports privacy.
Trusted Execution Environment
Trusted Execution Environments (TEE) provide a hardware solution to computing algorithms securely. They are an isolated part of CPU that other hardware cannot access and the integrity of the computation can be attested to. There are a few different frameworks that support TEE. Intel has a framework called Secure Guard Extensions (SGX). AMD provides a framework called TrustZone.
To get the House Credit Default model running inside a TEE we use a few different frameworks:
- Intel SGX
- Tensorflow Lite
Asylo is an abstraction that allows us to easily compile programs to run on Intel SGX and other TEEs. Asylo has an easy to use simulator that allows programs to be tested in the simulator before deploying to Intel SGX machines (which can be hard to find). They also provide a docker container so that we don’t need to install anything when we’re first running inside of the simulator.
Tensorflow is used to delegate the operations to the TEE. The motivation here is that you could be processing a larger computation graph where some of the operations will happen in the TEE and some will happen on the CPU. For our use case, we’re running a full computation graph inside the TEE.
A Tensorflow custom operation is used to connect to the Asylo stack using gRPC which then makes calls to the Tensorflow Lite library in C++. Tensorflow Lite is able to load the model and inputs, then evaluate the model and return the output. We had to patch Tensorflow Lite slightly so that the Asylo toolchain was able compile it and execute it inside the Intel SGX simulator and Intel SGX device.
All of this is wrapped up in framework we’ve developed called tf-trusted. It provides an easy way to use all of these components to run most Tensorflow-based models privately. To get started running this model on the Intel SGX simulator you should follow the instructions in the tf-encrypted repository examples directory located here.
We have benchmarked Tensorflow and Tensorflow Encrypted on Google Cloud Platform. Both platforms ran on a single CPU with 30 GB of memory. Due to availability of Intel SGX machines we ran tf-trusted on Microsoft Azure.
Running in plaintext is our baseline, but we can see that both tf-trusted and TFE scale similarly with about one order of magnitude slowdown. Interestingly, as the batch size increases, TFE becomes more efficient than SGX. We attribute this switchover due to the significant amounts of work put into Tensorflow by their dev team (TFE runs as plain tensorflow graphs), while SGX doesn’t yet scale as well. Though we have not included it here, we can improve TFE’s computational performance by using more hardware when either the batch size or model complexity increases.
As we have learned, deploying a model in a secure, privacy-preserving manner does not incur a prohibitive amount of overhead. There are several known optimizations for both TEEs and TFE that are left for future work. Frameworks like Slalom can increase model performance in a TEE by outsourcing expensive math operations to GPU while maintaining privacy. In TFE, there are other protocols to be leveraged and experimented with. As we saw in our previous post, secure computation has come a long way in 2018 and we expect improvements to keep coming in 2019 and beyond.
Private computation has a reputation of being too slow to be practical. As we have seen here, there are many use cases today that can both benefit from privacy and be privately deployed with an acceptable overhead. Many of the applications we use today have stages that take well beyond a few milliseconds to perform their duty. We are excited about the future where technologies like MPC, HE, TEEs will be fast enough to have feature parity with their plaintext counterparts, but for a broad spectrum of applications, that future is already here.
About Dropout Labs
We are a team of machine learning engineers, software engineers, and cryptographers spread across the United States, France, and Canada. We’re working on secure computation to enable training, validation, and prediction over encrypted data. We see a near future where individuals and organizations will maintain control over their data, while still benefiting from cloud-based machine intelligence.
If you’re passionate about data privacy and AI, we’d love to hear from you.