Privacy-Preserving Training/Inference of Neural Networks, Part 1

This is Part 1 of a series of 3 posts. Part 2 is here, and Part 3 is here.

Daniel Escudero
The Sugar Beet: Applied MPC
8 min readJan 21, 2020

--

In recent years machine learning (ML) has enabled recognition of sophisticated patterns in large amounts of high-dimensional data. As computational power increases, more patterns can be unveiled, yielding applications that may not have been conceivable before.

There is little doubt that ML is an invaluable achievement. However, as more applications are discovered, obstacles are encountered as well. Some are of technical nature, like the lack of computational resources, the lack of training data, or the lack of results from the research community. However, others are of a more social nature. For example, there are plenty of scenarios where machine learning can provide unprecedented benefits but where the training data is highly sensitive.

As an example, consider DNA analysis. This task requires multiple samples from many patients treated by varying medical entities. Data owners may not be willing to share their data with a central service in charge of training the final model, due to the sensitive nature of medical data.

Observe that this trust issue does not only arise at training time but also at inference time: some users may not be willing to provide their DNA information for analysis. One way to solve this — in the case of inference over one single record — could be to make the prediction model public. This enables that the data-owner can perform inference on her own. Unfortunately, even the model can be considered sensitive in many settings: It may constitute intellectual property, it may be the result of expensive training and data analysis, and it may represent a valuable economical asset for its owner. If no party is willing to disclose information, how can this obstacle be overcome?

This “paradox’’ can be seen as a particular case of the problem of computing some given function on “hidden’’ data. Fortunately, the relevance of such a task is well known and has long been studied by the cryptographic research community. A fruitful research field exists that yielded extremely positive results during the last few decades, with more solutions crossing the boundaries of practicality every year.

In this series of posts we will discuss some of these techniques in the context of privacy-preserving machine learning (PPML), which is concerned with the task of applying machine learning techniques (e.g. training and predicting) on sensitive data with much better privacy guarantees than the scenario in which both the model and data are known to a single party. More specifically, we tackle Neural Networks (including Convolutional Neural Networks), leaving the discussion of other types of ML models to a later post.

This first post deals with a high level description of what the tasks considered here are, and what the solutions out there can offer in general. The other parts will discuss existing research works in more detail.

Securely Training a Neural Network

The task of training a neural network while preserving the privacy of the training data finds many applications. Especially today, where individuals and organizations are increasingly concerned with how data is handled by companies and institutions. However, what “security” means here is not completely obvious. In general, one would like to ensure that the data used for training remains only known to the party (or parties) that provided it in a first place.

We may think of the data used for training as a matrix, with the rows being the different records and the columns corresponding to the different attributes/features. This data may be horizontally partitioned, meaning that each of the data owners have some rows of this matrix, or vertically partitioned, meaning that each data owner has some columns of this dataset. Furthermore, one can also consider a mixture of these.

In theory, Multiparty Computation (MPC) can help in this direction by letting the different data owners run the training algorithm on this data using a general-purpose MPC protocol, which results in the final model being outputted without leaking the original data nor the intermediate steps (we are ignoring here an actual problem, which is the information that the final model may leak about the data).

Unfortunately, some problems arise with this approach. First of all, a general-purpose MPC protocol may not be suitable for the specific type of computations arising in NN training. Furthermore, even if a more optimized protocol is put in place, the complexity of this solution tends to grow as the number of data owners increases. Since it is easy to think of scenarios where many data owners may contribute to training the model (e.g. building a text prediction app from data held by thousands of customers), the scalability of using such generic approach is questionable. Finally, in many settings the data owners cannot be assumed to hold the computation and networking resources needed for the execution of MPC protocols.

A more common approach followed by many MPC-based secure training works is the client-server model (also known as delegation model). In this model, data owners secret-share their information to computation servers, who will perform training on this data while revealing only the final model. This model assumes that no more than a certain threshold of the servers is passively/actively corrupt (among other things). One of the main advantages of this method is scalability: an arbitrary number of clients can provide input to a fixed amount of servers, and clients may be computationally bounded as they only communicate with the computation servers once (or twice, if the clients are to receive output as well). Also, the initial distribution of the data is irrelevant for the computation since the data is ultimately secret-shared among the computation servers.

The main downside of the client-server setting is its trust assumption: clients must be willing to believe that the security assumption holds for the servers. This means that no more than a certain threshold of servers collude, otherwise the protocol is insecure (either passively or actively depending on the underlying guarantees). This is a useful trade-off to be made, as the servers used for the computation can now be more powerful, have good connections among them, and they can be set up such that trust assumptions can be as weak as possible (this increases efficiency since, the lower the corruption threshold, the lighter MPC protocols tend to be). Finally, another advantage of the delegation setting is that the computation parties obtain a secret-shared version of the trained model which can be used later to perform secure prediction, as we will discuss shortly in the next section.

MPC is not the only technology that can be used for training a Neural Network without leaking the training data. Widely studied and very suitable for this setting is the concept of Homomorphic Encryption, which allows for computation on encrypted data. We will not discuss this approach extensively. In the next post, we will present specific research that deals with this task and touch upon scenarios of homomorphic encryption.

Finally, another popular technique is the so-called secure aggregation (see for example the work of Bonawitz et al.) in which the training algorithm is not executed securely as a whole. Instead, only the phases that involve the training data are computed with the help of the data owners, thus reducing the communication and computation complexity of the algorithm. A bit more concretely, training does not use individual records directly, but a certain weighted sum of some function of all the records. With this observation in mind, one may consider some small secure computation protocol for this specific step that guarantees that only this weighted sum is revealed (hence the term secure aggregation).

Securely Evaluating a Neural Network

Another natural task is to evaluate an existing, pre-trained neural network while preserving the privacy of both the input and the network itself.¹ This setting is very natural. For example, imagine a company that has made a considerable investment in training a model with the goal of profiting from it by running, say, Machine Learning as a Service (MLaaS). The company would not want its model to be leaked as it constitutes a valuable asset, so clients send their inputs to the company for inference. However, depending on the application, these inputs may be sensitive and clients may not be willing to give away their data.

Just like with the case of secure training, MPC techniques can be leveraged to develop secure inference tools. A direct way of achieving this is let the data owner and the model owner execute a secure 2-party protocol that evaluates the model on the given data. Alternatively, both parties may secret-share their data towards some computation servers who would run the evaluation procedure and return the result, as in the client-server model discussed previously. This is effectively what many MPC-based secure prediction works do and has the advantage that the computation parties may be in the honest majority setting (where no more than half the parties are assumed to collude). This setting is typically more efficient than the 2-party setting. For example, it is becoming increasingly popular to consider three parties with at most one corruption, since specific protocols for this setting can be heavily optimized.

Finally, it is important to notice that many of the techniques below would benefit substantially if the model could be public, leading to much more efficiency. This setting may initially seem unnatural given that, if the model is public, why not let the clients perform evaluation on their own? On the one hand, one may argue that if the model is too complex then clients may want to outsource their evaluation. However, in this case, if a model is too complex to run in the clear (even on resource-constrained devices) then it is natural to expect that the secure evaluation of such a model would be impractical. Another natural scenario is where the data on which the prediction is performed is already secret-shared, or “encrypted’’, and therefore there is no single party who could run the evaluation in the clear. A good example of this is a service that keeps e-mails in secret-shared or encrypted form, but still uses an ML model for spam detection.

In this first post we discussed the tasks of training and evaluating a neural network securely, and some general templates for using MPC techniques. A careful reader would notice that this write-up could make use of many more references and more thorough explanations. Not to be worried though! The next two parts will deal with specific works that approach these tasks, so detailed references and explanations will be included in those posts.

Footnotes

  • ¹Most of the existing works actually leak the architecture of the model (e.g. number and type of layers, as well as their dimensions), given that this information is usually needed to be able to carry on the computation. Only purely Homomorphic-Encryption-based approaches avoid this leakage, since in this case the evaluation of the model is done at the model owner’s end.

--

--