Machine Learning on Encrypted Data: No Longer a Fantasy

Published in

Intuit Engineering

6 min readFeb 19, 2020

At Intuit, the proud maker of TurboTax, QuickBooks, and Mint, we’re the trusted stewards of our customers’ data, which is a responsibility we take seriously. As part of this commitment, we innovate in our security practices — just as we innovate in our product development — which includes exploring advanced privacy technologies. As someone who cares about privacy and encryption, this makes Intuit truly exciting for me.

Lately, my team has been researching “operations on encrypted data.” Through this research, we aim to provide a secure environment for data scientists.

Now, what exactly do we mean by operations on encrypted data? People tend to think of data encryption as a black box: on the one side enter unencrypted plaintext data and a secret key, and out the other side comes gibberish, also known as ciphertext. A second black box, called decryption, reverses the process and reveals the plaintext again. It’s important to note that if you change the ciphertext — even slightly, by flipping a single bit — your plaintext will be corrupted and come out completely random.

This view of encryption is still generally true, at least for the most common block ciphers like AES (Advanced Encryption Standard), but in the last few years an important exception has emerged in the case of homomorphic encryption. Homomorphic encryption is, first of all, encryption. As with AES, it can be represented this way:

ciphertext = Enc(key, plaintext) and

plaintext = Dec(key, ciphertext)

Homomorphic encryption also has more interesting properties. One of them is this: you can change the ciphertext and get back useful plaintext, and specifically, you can actually compute basic arithmetic on encrypted values. To simplify a bit:

Enc(a + b) = Enc(a) + Enc(b) and

Enc(a * b) = Enc(a) * Enc(b)

In other words, while performing any operations on data encrypted using a traditional cipher would result in gibberish, homomorphic encryption allows you to do it without corrupting the data. This goes further than basic operations. Being able to perform addition and multiplication also means that you can compute polynomials. And, with polynomials you can approximate essentially any function. We’ll discuss what this means in a minute.

Encryption and big data

Homomorphic encryption is fascinating for a cryptography buff like me — but is it at all useful? For a long time, people have suggested that a version called fully homomorphic encryption, or FHE, might offer a way for cloud providers to run computations on their tenants’ data, without having cleartext access to this data. That might improve security and privacy, but I personally doubt this will ever become viable. The performance degradation associated with this technique makes it unsuitable for general computing, as would be expected from a cloud provider. On the other hand, FHE does show considerable promise as a way for enterprises to significantly harden their resistance to cyber attacks in specific use cases. To see why, let’s look at today’s artificial intelligence (AI) practices.

Some large enterprises concentrate large amounts of data into a so-called data lake. In the middle of this lake sits a cloud application server which runs a large number of AI applications called machine learning (ML) models. A centralized architecture often leads to a large number of people having partial or full access to an enterprise’s data lake.

My team’s goal is to allow as many ML models as possible to run using only encrypted data, so that in the event of an attempted attack anywhere in the data lake, the attacker wouldn’t be able to access sensitive data. How? Remember that homomorphic encryption allows you to perform mathematical functions on encrypted data. In this way, sensitive data is stored in encrypted form, and ML models are re-implemented using homomorphic operations.

Here’s what such a solution would look like:

(As a side note: homomorphic encryption is not a panacea, and it should be deployed along with more conventional security solutions such as fine-grained access control.)

Working at the leading edge of cryptography

Neural networks currently represent the most advanced types of ML models, and in fact there has been some academic work done on homomorphic computation in this context. For now, the workhorse of many ML applications across the industry and at Intuit remains decision trees and their cousins, random forests and boosted trees. Recently, my team collaborated with researchers from Haifa University on a paper showing that decision trees can be evaluated homomorphically in real time for realistically-sized data sets. The paper also shows that such models can be trained in practical time, again for real-life data sets. We do it by approximating the threshold function:

y is 1 if x>T, otherwise y is 0

The function is approximated by a polynomial which in turn can be computed homomorphically. This becomes a building block in the homomorphic evaluation of the decision tree, as the tree is a sequence of conditional (“if”) statements. In this way, the paper shows that homomorphic encryption can be a practical method for safeguarding data in real-world scenarios.

Which kinds of scenarios, exactly? In the public cloud use case, we have to assume very rigid separation of cryptographic keys: the cloud provider never gets access to the encryption keys for the data it holds. That’s one of the things that makes me skeptical about the viability of homomorphic encryption for this type of application.

The enterprise use case, on the other hand, allows more architectural flexibility, which we can use to speed up homomorphic computations. To do this, the proposed Intuit architecture includes a dedicated server called “the oracle” (no relation to Oracle the company): a stateless server that has access to the encryption key. Security folks can think of it as analogous to a hardware security module (HSM). Of course, we do not run the actual ML models on the oracle; this would get us back to the original architecture with the oracle serving as the application server, which in turn would pose serious security risks. Instead, we use the oracle only for specific calculations. The oracle is capable of a very limited set of operations, and only has access to aggregates of the data, rather than to individual data points. Despite these limitations and the overhead of calling a remote server, the oracle provides homomorphic operations with a significant speed-up.

Architecturally, the oracle is a separate component that sits alongside the ML server:

A similar approach results in another major benefit compared to other fully homomorphic storage solutions: it prevents what’s known as ciphertext blow up. With standard FHE techniques, the size in bytes of the ciphertext is about 10,000 times the size of the original plaintext (for comparison, standard symmetric data encryption only adds a few bytes to the plaintext size). This is clearly unacceptable for bulk storage of big data of the type we usually see in AI applications. The oracle approach keeps the size of the stored data manageable and realistic for real-time operations. We recently published a note describing how the combination of oracle and blinding makes it possible to store bulk data with standard symmetric encryption and re-encrypt data on the fly into FHE with good security properties and practical performance. In particular, in this solution absolutely no information is leaked to the oracle, not even aggregate values.

We’re excited about the results of our research to date, but we’re not done yet. For example, the existing techniques still require a lot of custom work for each ML model to be adapted to the FHE environment. Streamlining this process will go a long way to enhance the practicality of this approach in widespread use.

The technological challenges around homomorphic encryption — and the opportunities it presents — are too big for any one company to take on. The tremendous progress made since Craig Gentry published the initial FHE schema in 2009 has been made possible by active collaboration across organizations. Nowadays, much of this work is centered around the homomorphicencryption.org, an industry consortium that Intuit is active in.

My team is incorporating recent cryptographic research into Intuit’s multi-tiered security strategy to resolve real-life security and privacy problems. If you’d like to become part of these efforts, please reach out. We welcome your contributions!

Machine Learning on Encrypted Data: No Longer a Fantasy

Written by Yaron Sheffer