The world needs privacy-preserving computations

Anton Dimitrov
18 min readJan 15, 2019

--

This article offers a high-level overview for anyone getting interested in the techniques used to address data privacy when training deep learning models or just working with data in general. It can hopefully serve as a starting point to this exciting field.

In case you have been following the growing popularity of deep learning you will know how valuable data is to everyone training various models. The field is evolving rapidly and researchers are working hard on improvements to existing techniques. Also, every now and then someone comes up with a completely new idea that disrupts an area of AI. These are all things that people do to train better models faster, cheaper and potentially requiring less data.

Still, data remains the irreplaceable fuel that makes all these algorithms run. And talking of data let’s not limit ourselves to just deep learning. For example, many companies need a lot of data in order to improve their business by just doing all kinds of analytics and other simple computations. Think about all the use cases one can come up with: airlines want to optimise their pricing and flight schedules, banks want to become more accurate in evaluating credit applications, medical researchers want to get better at identifying diseases earlier and more accurately. Some of these can be solved well using a deep neural network, others require some statistics to be computed on a bunch of data points and so on. One common thing is the need for data, and if possible more of it.

Here we reach the main topic of this article. Getting more data often means facing privacy concerns. “Getting the data” here means having access to it in plain text in order to compute various things using it. If you want to train a classification model that tells you something about the health of a patient in a hospital you will likely need a corpus of training data to achieve good results. Naturally, health data is usually very private and must be handled with a lot of attention. Similar concerns apply in many other cases. Just getting access to some corpus of data to perform computations on it can be a daunting task and in many cases it rightfully is. Leakages of personal data can have terrible consequences on many people’s lives, so it seems right to be extremely careful about who can see it and use it.

Often data needed to train a better model is private and cannot be accessed in plaintext or moved around

So, how do we make use of private data without sacrificing its privacy? In recent years several smart ideas have emerged and were developed further to make possible and scalable privacy-preserving computations in different use cases.

But wait, what is “privacy-preserving computations”?

By this term we mean computations performed by one or more parties on one or more separate datasets without having to reveal the underlying data to the computing parties. Basically, only the final results of the computations would get revealed to an external party that does not own the raw input data. This result could be the weights of a trained neural network, logistic regression parameters or some sort of statistic like correlation, average values, etc. Initially, being able to do this kind of thing without revealing the underlying data may sound like magic. At least, it sounded to me like that in the beginning. But, it’s possible to do such things assuming some trade-offs.

And, yes, it’s true that often the results of a computation can reveal too much about the raw data. This important concern will also be addressed below.

The rest of this article will go through several different approaches that make privacy-preserving computations possible. The goal is to lay out briefly the most popular ideas in use today and to give you a quick intro to this field. We will also look at some use cases in which each technique is useful.

The following popular topics are covered further:

Secret multi-party computation

Let’s start with one use case. A company “ABC” has offices in different countries and all these offices collect and store data about their local customers. The data scientists in one of these offices, using their local data, have devised a model, which improves the services offered to the customers. In addition to that, they are pretty sure that if they were able to use the data stored by the other offices of the company, they would improve their model further. However, due to various regulations they don’t have the required access. Moreover, in some countries it may be forbidden that the data leaves the country itself. This prevents the model from being improved. Thanks to advancements made in the field of secret multi-party computation (SMPC), it may be possible to train the model on all datasets spread across countries without needing access to the actual plaintext data.

In short, SMPC allows several players to perform computations on several separate private datasets as if these datasets were available in the same place in plaintext but without actually moving the data to anywhere or revealing it to anyone besides the owner of the data.

Let’s use an example to clarify this. Say that there are datasets describing the customers of company “ABC” in three different locations and these datasets cannot be gathered together to train a model on them. Let’s say the model is a logistic regression. With the help of SMPC it would be possible to let the three offices compute the parameters of a logistic regression together being able to look at their own data only and by exchanging a bunch of random numbers with the other offices while computing. The key here is in these random numbers that get exchanged. By themselves they don’t reveal anything about the data that each office holds but they help the separate offices perform computations that are derived from the three datasets at the same time.

A very simplified diagram of how SMPC could help train a model on three private data sources without revealing or moving the private data outside of its location

In this case the SMPC protocol used would rely on additive secret sharing. This technique allows to share a private value among several players without them knowing the actual value. However, they would still be able to perform certain operations on such secret shared values and produce a result in the end. Essentially the values used to secretly share a private value would be generated at random. A famous protocol using this idea is SPDZ, which evolved over the years and a successor of it is SCALE-MAMBA. Another protocol using secret sharing has been developed by the team at Inpher.io, the company where I currently work. It has already been used in several real-world applications.

Generally, these protocols rely on two phases. The first one, usually called the offline phase, is used to generate a set of random values that would make operations like additions and multiplications possible. These random values are generated and distributed to the computing players by a separate trusted party. After that, the second phase, usually called the online phase, is where the actual computations are carried. It involves network communication between the computing players. They need to exchange random values used for secret sharing in order to perform the final results. Of course, this setup has its specifics and makes it suitable for only some use cases.

There are also certain limitations to this approach as of now. First of all, the computation would take significantly more time than it would if performed on a plaintext dataset that is public. Significant progress has been made in recent years in order to improve the speed of SMPC protocols. The networking latency is something that will probably always exist given that the datasets are by definition in different locations and sometimes this could mean in opposite parts of the globe.

It is also not trivial to perform any kind of computation. The main operations possible with additive secret sharing are addition and multiplication. This means that if you need to perform such operations on data that is spread across multiple locations, they will be supported out of the box. However, there are issues with various non-linear functions used in deep learning like sigmoid or tanh, for example. Luckily, there are some smart ways to approximate them with polynomials, which consist of the more SMPC-friendly additions and multiplications.

Another thing that is hard to do is comparisons. A popular example in the deep learning world is the ReLU activation function. To evaluate it or to compute the derivative one needs to perform a comparison:

ReLU(x) = max(0, x)
ReLU’(X) = (X≤0) ? 0 : 1

To do comparisons with SMPC one could use garbled circuits but this generally slows down computations a lot. It’s out of the scope of this article to describe why. So, people are looking for workarounds like using approximated sigmoid instead, the softplus function, just the squared function, or some other polynomial functions that approximate the ReLU well enough. Another example where comparisons are needed is computing a Max pooling layer in a CNN. Some people use average pooling layers instead, which do not require comparisons and are some kind of approximation.

However, some recent advancements in research have made it possible to perform comparisons without any approximation faster than before. SecureNN developed by researchers from Microsoft, as part of the EzPC project, suggests a protocol for faster SMPC with operations like comparisons in a 3 and 4-player setting. It is focuses on making deep learning with multi-party computation more scalable and thus the authors have optimised things for functions used in modern neural networks.

One implementation of this idea comes from the team at Dropout Labs in their tf-encrypted project, which is open-source and welcomes contributors. Using the ideas from SecureNN they can compute things like ReLU and max-pooling without the need to approximate with other functions and the execution speeds are becoming quite practical for neural network training, for example.

Another interesting open-source community focusing on SMPC is OpenMined.org. Their project PySyft provides tools for performing secure computations on distributed datasets and is definitely worth checking out. In case you find this area interesting definitely consider joining their slack channel where you can learn how to contribute to the code and communicate with other people from the field.

This is a very high-level and probably incomplete practical overview of the field of multi-party computation. Hopefully in a further post it will be possible to go into more details about it. For the moment here are a few useful references for further reading:

Fully Homomorphic Encryption (FHE)

Imagine that someone has trained a very useful model and would like to offer it as a service where customers provide input data and would like to get back results. However, the owner of the model doesn’t want to share it in plaintext with the customers because it is their intellectual property. Customers could send the model owner their data so it could run the model on this data and return back the results. However, let’s assume that customers are not really happy to send their input data over to anyone. In this situation the model cannot be used. Secure multi-party computation could help here if we assume that we have two parties, the first one has its model as input and the other is the customer with the input data for the model. We would need a trusted party for the offline phase and also let the two computing players communicate and do the computations for the inference during the online phase. It could be possible to avoid that using a different technique called fully homomorphic encryption (FHE).

It’s a kind of encryption, which allows certain operations to be performed on the encrypted data and the result of this computation is also encrypted with the same keys. When decrypted it contains the result that would have been obtained if the same operations were performed on the plaintext data. More magic!

A model owner and a customer willing to do inference on their own private data. Using one encryption key they could provide encrypted data to the model owner and get back the encrypted result.

The first cryptosystem for doing this was suggested by Craig Gentry in his PhD Thesis. The main problem with this approach is speed of execution. It’s terribly slower than performing the same operations on the plaintext data. However, the initial ideas have been further developed in recent years and these days it’s feasible to perform inference using an existing model on encrypted data. This has been made even easier recently with Intel’s release of he-transformer, which is a backend for their nGraph compiler and this makes it possible to perform inference with an already trained model using Tensorflow.

There are also a few popular open source libs implementing different FHE cryptosystems that can be of interest: SEAL, TFHE, HElib, PALISADE.

Training of any reasonable machine learning model is not yet something that seems practical but it may become given that new research breakthrough is achieved.

Some more useful references:

Differential Privacy

SMPC would not be very suitable in cases when there are a lot of private data sources, like millions. The main problem would be the need for tremendous network communication because with SMPC we expect every computing player to communicate with every other. Let’s consider an application that has millions of users and it collects certain statistics from these users. In many cases collecting the plaintext data from each user is not an option because it violates the user’s privacy. Or if the data could be collected we must be very careful about how it is used in order to avoid any information leakages about individual users.

A technique called differential privacy (DP) can be quite helpful in such scenarios. Popular real world applications that resemble this situation can be found in Apple’s and Google’s products. Apple collects information about websites people visit on their devices or the emojis that are used in certain situations. Google used techniques like that in the Chrome browser using the RAPPOR technique.

Differential privacy is a statistical technique used when answering queries on a dataset. It was first introduced by Cynthia Dwork and Frank McSherry. The goal is to not reveal private information about individual contributors to the dataset and still produce accurate results. In short, the idea is to add noise to the plaintext data where this noise is drawn from some distribution — usually Laplace or Gaussian. The original values are hidden this way but accumulating enough of them is supposed to cancel out all the noise added and still produce meaningful results. One important term behind this approach is the privacy budget, which is a value that identifies how much information leakage can be allowed on the plaintext data. Information leakage can occur when certain queries are made against the underlying dataset. This happens because each query result reveals something about the data on which it was computed. When enough queries are accumulated it could become possible to infer meaningful information about the plaintext data. The goal of differential privacy is to limit the allowed queries and to apply enough noise in order to stay below some preset privacy budget. This would guarantee that the underlying private data is not leaked to a potential adversary. As you can see there is a trade-off to be made between privacy and accuracy of the results. Applying more noise generally would mean better privacy but also less accurate results. Unfortunately, DP does not come for free but with enough data it could be applied well enough without sacrificing people’s privacy and still getting useful results.

Why is all that needed? One could ask if anonymisation of users’ data is good enough to hide the private bits of information. The past has shown that this is not always the case. Probably the most famous case is a Netflix contest in which an anonymised dataset with movie reviews from Netflix users was provided as input. Later, researchers showed that using some additional external data from IMDb they could identify which records in the contest dataset belonged to particular people that also had accounts in IMDb. This information could be used to infer political views and other personal information about actual individuals.

A good and simplified way to describe the privacy guarantees required by DP is that the results for a query on a given dataset should not differ significantly regardless of whether the dataset contains a particular record or not. And below is an oversimplified example of what this could mean:

A dataset contains records for separate people and for each person it has a boolean value — does the person have a certain disease or not. Let’s assume an adversary could query for the sum of values for any subset of people. It would be possible to send two queries then — 1) the sum for all people in some set U and 2) the sum for all people in U without X, where X is some particular person. If the two sums are the same then this means that X didn’t have the disease and vice versa.

Here we would want the answers to the queries from the example to be such that nothing could be inferred about person X by just taking their data out of the dataset. A simple and useful technique to read about is randomized response. Modifying every single answer like that makes it impossible to simply infer what each person had answered. Even if an adversary learns a person X’s “randomized answer” they cannot be 100% sure that this is the real value that corresponds to person X. And this is what we really need to guarantee the privacy of people who have contributed to a dataset with their personal data.

At this point it’s good to make a distinction between two use cases for differential privacy that were covered above. One is the case in which individual users contribute their data to some aggregator so that it can be used in some way. In this scenario it is not necessary that the actual plaintext data gets sent to a central location. It could be that the user modifies it before sending it over. This is usually called local DP. Such are the cases with Apple and Google mentioned above. An alternative approach can involve a trusted party, which receives the plaintext data from each separate data contributor and applies DP on this aggregated data. This is usually called global DP.

Local vs global differential privacy. Mobile devices interact with a central server. Red arrows indicate the steps when noise is added in the two cases.

Above we mentioned the privacy budget notion and how making more queries against a dataset would lower it, which generally limits the number of queries one could want to answer to preserve differential privacy of a dataset. Recently there has been some great research work on the topic that guarantees DP with an unlimited number of queries, namely the PATE framework. The suggested approach makes it possible to maintain a constant privacy budget regardless of the number of queries made. A curious thing is that using this technique to achieve DP one also achieves regularisation of the trained model, which is described as a “pleasant synergy between privacy and learning” by the authors of PATE.

And one final thing to check out: recently a new library for differential privacy was released by the TensorFlow team. It has a tutorial in the repo and is generally a good start for playing around with DP.

More useful references:

  • Great article talking about DP in general, local vs global, some possible uses of SMPC and FHE to the global case, and more
  • Quick intro to DP with some real-world examples
  • In case you are interested in reading some of the first publications shaping up the term differential privacy here are a couple
  • A couple of great articles for beginners that go into some of the math and have code samples

Federated Learning

An related approach to differential privacy is the idea of federated learning (FL). Instead of having separate players send over noisy data to a central location you could have a system in which the model is sent to these users for local training using their local data. This means that each separate player gets a copy of a model’s parameters, trains it with some private data it owns and sends back the updates it has computed. The updates received from multiple players are to be averaged out in order to apply them to the existing model.

The technique was initially developed by Google. It has been used in the Gboard application to train a better model for next word predictions. But it could also be used to personalise the suggestions made on each separate phone.

An important advantage of this approach is that no private data needs to leave its original location. Compare this to the differential privacy approach where data would generally have to be collected in one central place in one form or another. However, things are not so simple and additional care needs to be taken to achieve the necessary levels of privacy. But let’s first give some more details about how FL works.

At each step a subset of all players needs to be selected and they receive the latest version of a model. Using their local data they train the model for a number of iterations. Usually training would be performed using some gradient-based approach. The computed gradients are sent back to a server and its job is to average them out to get one unified update.

Another simplified but hopefully useful diagram: 1) server sends a trained model to a set of players (e.g. mobile devices), 2) each player trains the model on local data and thus computes gradients, 3) players send the computed gradients back to server. After that the server needs to average the received data and update a central version of the model. Then repeats the procedure with another set of players.

Some more nice benefits of the approach are that a lot of the computations are now moved to the players. This could be mobile devices, various gadgets connected to the Internet or laptops and desktop machines. The server is no more loaded with doing all the heavy training of deep machine learning models, for example. No huge amounts of data need to be collected and stored securely, too. Let’s not forget one more important feature that we get for free — by sending the model to the separate players they could actually use it even if they are offline. Moreover, there is no need to send requests to a central server to make inference using a model each time this is needed. The model is available locally and the latency in getting predictions from it is negligible when compared to the standard client-server setting. So, in addition to the main goal, which is to achieve data privacy while training and using a model, we get a bunch of other useful benefits along the way.

Now, let’s look at a few issues that need special attention. First of all, a model like a deep neural network could contain quite a lot of parameters. Unlike distributed deep learning training in a datacenter, with FL we don’t have the guarantees of high speed network connections. This could present a challenge and thus FL usually involves compression of the gradients that get sent back from the players to the server. There are several different approaches to achieving compression that have been shown to work well in production.

Next, even if no actual private data is sent over to a server, it could be possible to infer too much about the training data by looking at the trained model or the gradient updates themselves. To avoid this kind of privacy issue one could turn to the previously discussed differential privacy technique. In short, noise could be added to the reported gradient updates in order to hide the actual data that is private.

An alternative idea called secure aggregation was developed by a group of researchers primarily from Google, which allows a server to aggregate encrypted values from multiple players only if a certain number of these values have been received. This means that no individual values can be decrypted and thus it is not possible for an adversary to look at individual gradient updates, for example.

Finally, the nature of FL makes it possible for individual players to tweak a model to work for their particular case. For example a user of a mobile phone could start seeing word suggestions that are specifically tailored towards the word usage they have on their phone. There are certain challenges in making this work properly but it certainly is one more advantage that comes along with FL.

More useful references:

Overview

Nowadays there are plenty of real-world use cases that require privacy-preserving computations. Training of deep neural nets and using them for inference may be the main source of this need but there are many others as well. Data is generated by people and various devices at an accelerating speed and much of this data is private. This increases the demand for new techniques that allow useful computations to be done on such data without exposing it to anyone besides its owner. Depending on the use case, we could apply one or more different techniques each with its own pros and cons. There is no one universally best approach. That is why this article covers the most active areas of research and tries to show some of the practical application at the moment. This is an exciting field to be working in and it promises to become more relevant in the near future. New research gets published quite often and this has led to some great progress in recent years. Luckily, there are also open-source projects that are actively looking for new contributors as well as companies working on related products.

--

--

Anton Dimitrov

Software engineer passionate about algorithms, machine learning, privacy-preserving computation. https://www.linkedin.com/in/antondimitrov