Learn Privacy-Preserving Machine Learning with SecretFlow

Published in

SecretFlow

10 min readFeb 23, 2024

Artificial intelligence (AI) driven by machine learning technologies has reshaped almost every aspect of our daily lives. From the widely-used language translation and image recognition services to the emerging ChatGPT¹ and Sora², AI brings a huge amount of convenience to our society. Simultaneously, these daily-used AI services constantly collect, store, and process our data. The data security and user privacy issues raise increasing public concerns. In this article, we will introduce a cutting-edge technology, i.e. privacy-preserving machine learning (PPML), which aims at achieving privacy protection and AI utilization at the same time. In the remainder of this article, we try to cover the following topics for our readers who are interested in PPML.

What is PPML? What is the difference between PPML and ML?
Why PPML can protect users' privacy and address data security concerns?
How to use a popular open-source PPML framework, SecretFlow, to achieve PPML tasks.

Privacy-Preserving Machine Learning

Before delving deeper into PPML, we first briefly revisit the traditional machine learning (ML) pipeline. ML typically involves two main phases: training and prediction (also known as inference). Training refers to the process of generating a model from a large volume of pre-processed data using some pre-defined algorithms. Prediction, on the other hand, is the process of feeding inputs to the trained model and then producing some meaningful outputs.

More specifically, let's take OpenAI's ChatGPT as an example. To build ChatGPT, OpenAI's experts first train a large language model from vast amounts of textual data (such as Wikipedia). This training process is computation-intensive and extremely time-consuming, requiring copious training data. Once training is done, OpenAI's developers encapsulate this model into the ChatGPT service which is then open for public users. ChatGPT's serving triggers the prediction phase, where user prompts are sent to ChatGPT, acting as inputs for the backend model. The model then generates outputs and returns them to users.

As we can see, both training and prediction phases process data, which raises many concerns regarding data security. For instance, training a powerful model requires a vast amount of high-quality data. As illustrated in Figure 1, if these data come from various institutions, companies, and organizations, these data owners do not wish to expose their raw data to model trainers. Is there still a method that trainers can utilize to generate a model while meeting the requirement that they do not know the original training data? A similar scenario also lies in the prediction phase. If a user's prompts/queries contain some private information, is it still possible to make the model output results without the user leaking this sensitive information to service providers?

Fortunately, the answers to these questions are positive. The technology that addresses these issues is Privacy-Preserving Machine Learning (PPML). Currently, PPML consists of a series of approaches, including Secure Multi-Party Computation (MPC), Homomorphic Encryption (HE), Federated Learning (FL), Trusted Execution Environments (TEE), and others. In this article, we do not have enough space to cover all PPML approaches and our focus is MPC-based PPML.

Secure Multi-Party Computation

MPC is a cryptographic technique, first introduced by Andrew Yao in the 1980s through his famous Millionaires' problem, where two rich people want to compare their wealth without giving away the exact value³. More generally, MPC allows multiple parties to jointly evaluate a function without revealing any information except the final result.

Figure 2. Three parties jointly compute an addition function

To better understand how MPC works, let's consider a simple example. As illustrated in Figure 2, suppose Alice, Bob, and Charlie want to calculate the sum of their salaries without disclosing their salaries to each other. Each party starts by splitting their salary into three random values, sharing two of them with other parties, and holding the remaining value. This process is normally called secret sharing⁴. Next, each party computes the sum of their holding value and the two received shares. The intermediate results can be shared among the parties to construct the final result without revealing individual salaries to each other. Through this example, we can see how MPC ensures privacy and enables multiple parties to jointly evaluate a function without compromising the confidentiality of sensitive data.

Here is just a simple example of three parties jointly computing an addition function, but MPC is not limited to this and is much more powerful (also more complex). Smart cryptographers have developed so many MPC protocols to support the executions of arbitrary functions among arbitrary parties. Whether it’s four parties jointly training a model or two parties making a model prediction, all these computations can be supported by MPC in such a private manner. In general, MPC provides a powerful solution to balance the need for privacy while still allowing collaboration and data sharing.

PPML hands-on with SecretFlow

After knowing all the preliminary knowledge above, let's move further to develop our own PPML programs. At first glance, writing PPML programs seems to be more difficult. The good news is that there are already many PPML frameworks available on the market, that offer high-level APIs to free developers from struggling with low-level tedious cryptographic protocol implementations. Using building blocks provided by these frameworks, we can build PPML applications as easily as developing traditional ML applications.

One of the representative PPML frameworks is SecretFlow⁵, which is open-source and supports a range of privacy-preserving techniques, including MPC, HE, FL, and TEE. Additionally, SecretFlow has better compatibility with traditional ML programs and models compared with other PPML frameworks. In the following content, we will show how to develop a private text generation application based on the GPT-2 model⁶ using SecretFlow with a step-by-step tutorial.

GPT-2 is a kind of transformer model developed by OpenAI. Before developing the PPML version, we first show how to develop a normal text generation application with GPT-2. The machine learning community has developed so many frameworks/libraries to achieve this ability. Here, we choose jax⁷ developed by Google, and the transformers⁸ library developed by Hugging Face to implement it. First, we install the required dependencies. As transformersis dependent on jax, we only need to install transformershere.

pip install transformers[flax]

The following code snippet demonstrates how to use the pre-trained GPT-2 model to generate text. Here we get the basic version of the wanted text generation application. To be brief, the code here first loads the pre-trained GPT-2 model and tokenizer. We use the tokenizer to encode the query “I enjoy walking with my cute dog” and feed the encodings to the GPT-2 model. We use a greedy algorithm to generate 10 tokens from the model and decode the output, getting the response as “I enjoy walking with my cute dog, but I'm not sure if I'll ever”. We do not need to know every detail of this code as our purpose is to convert this version to our privacy-preserving version with SecretFlow. If you are interested, we recommend reading the Hugging Face documentation directly for more information.

from transformers import AutoTokenizer, FlaxGPT2LMHeadModel, GPT2Config
import jax.numpy as jnp


tokenizer = AutoTokenizer.from_pretrained("gpt2")
pretrained_model = FlaxGPT2LMHeadModel.from_pretrained("gpt2")


def text_generation(input_ids, params):
    config = GPT2Config()
    model = FlaxGPT2LMHeadModel(config=config)

    for _ in range(10):
        outputs = model(input_ids=input_ids, params=params)
        next_token_logits = outputs[0][0, -1, :]
        next_token = jnp.argmax(next_token_logits)
        input_ids = jnp.concatenate([input_ids, jnp.array([[next_token]])], axis=1)
    return input_ids


inputs_ids = tokenizer.encode("I enjoy walking with my cute dog", return_tensors="jax")
outputs_ids = text_generation(inputs_ids, pretrained_model.params)

print("-" * 65 + "\nRun on CPU:\n" + "-" * 65)
print(tokenizer.decode(outputs_ids[0], skip_special_tokens=True))
print("-" * 65)

Next, let's get started implementing our privacy-preserving version of text generation with SecretFlow. Of course, the first step is to install SecretFlow.

pip install secretflow

Then we begin to write our first SecretFlow program. We import the secretflowpython module and then initialize a SecretFlow cluster. Since SecretFlow is specifically designed for PPML scenarios which normally involve multiple parties. The program inherently runs distributedly. You can consider the cluster initialization as the first step for every SecretFlow program. Here, we are initializing a cluster with three parties: Alice, Bob, and Charlie.

In this tutorial, we are setting up a local cluster that runs on a single machine for demonstration purposes. But more commonly, these clusters are not local and are composed of multiple computing nodes distributed across different institutions. The program we are writing here is essentially a driver program. The SecretFlow framework takes over the orchestration of job scheduling and distributes the tasks defined in the driver program to different backend nodes to execute.

import secretflow as sf

sf.init(["alice", "bob", "charlie"], address="local")

Next, we set up two PYU devices and one SPU device. In SecretFlow, the device is an abstraction of a computational unit. PYU is designated for plaintext computation. Here we create two PYU devices Alice and Bob for processing data at the Alice party and the Bob party respectively. On the other hand, SPU is a special device that executes MPC computations. SPU allows for collaborative data processing in an "encrypted" form across multiple units. This encrypted data is distributed across a cluster of nodes as secret shares. Therefore, we need to specify the involved parties when we instantiate an SPU (line 2). Lines 3-4 are some advanced configurations for SPU. We can ignore them here as it does not affect how we understand SPU.

Beyond PYU and SPU, SecretFlow also supports other devices, like HEU which is based on Homomorphic Encryption, and TEEU which is based on Trusted Execution Environment. However, these devices are beyond the scope of this get-started tutorial. We will introduce them in the following articles.

alice, bob = sf.PYU("alice"), sf.PYU("bob")
conf = sf.utils.testing.cluster_def(["alice", "bob", "charlie"])
conf["runtime_config"]["fxp_exp_mode"] = 1
conf["runtime_config"]["experimental_disable_mmul_split"] = True
spu = sf.SPU(conf)

After setting up the necessary devices, we move on to writing the core logic code. As you can see, this code is almost the same as the non-privacy-preserving version mentioned above. The only difference is that we define two extra functions explicitly. The get_model_params function is used to obtain the parameters of the GPT-2 model (i.e., model weights), and the get_token_ids function is used to obtain the encodings of the user prompt “I enjoy walking with my cute dog”. The text_generation function remains unmodified.

from transformers import AutoTokenizer, FlaxGPT2LMHeadModel, GPT2Config
import jax.numpy as jnp


def get_model_params():
    pretrained_model = FlaxGPT2LMHeadModel.from_pretrained("gpt2")
    return pretrained_model.params


def get_token_ids():
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    return tokenizer.encode("I enjoy walking with my cute dog", return_tensors="jax")


def text_generation(input_ids, params):
    config = GPT2Config()
    model = FlaxGPT2LMHeadModel(config=config)

    for _ in range(10):
        outputs = model(input_ids=input_ids, params=params)
        next_token_logits = outputs[0][0, -1, :]
        next_token = jnp.argmax(next_token_logits)
        input_ids = jnp.concatenate([input_ids, jnp.array([[next_token]])], axis=1)
    return input_ids

The following code explains why we need to define these two functions in SecretFlow. First, we run the functions get_model_params and get_token_ids on the two PYU devices, i.e., Alice and Bob, to obtain model_params and input_token_ids, respectively. If we use the Python print function to print these two variables, we would see that they are just two PYU Objects, and we cannot see their actual values. Because this program is a driver program, the SecretFlow framework will execute these two functions remotely on the Alice and Bob parties. Although the model parameters and user inputs are public here in the demonstration program. In real-world scenarios, Alice and Bob could read data from local files, which would not be visible to other participating parties.

Then we use the toAPI to move model_paramsand input_token_idsto SPU, call the text_generationfunction on SPU, and get the output_token_idsas results. The code is straightforward, but the process behind the code is not simple. model_params and input_token_ids will be split into secret shares and distributed to SPU nodes, where the text_generation function is executed using MPC techniques. SecretFlow conceals all these complexities from us. We can see that Alice holds the model weights and Bob holds the input prompt. They are unaware of each other's data, yet they can still execute the text generation function together.

Here is another question: what role does Charlie play, as he does not seem to hold any data? Indeed, Charlie is merely a participant in the MPC computation within the SPU and does not hold any original data. We construct an SPU with three parties here because the three-party MPC protocol is much faster than the two-party MPC protocol. We could also define an SPU with only two parties involved if necessary.

model_params = alice(get_model_params)()
input_token_ids = bob(get_token_ids)()

device = spu
model_params_, input_token_ids_ = model_params.to(device), input_token_ids.to(device)

output_token_ids = spu(text_generation)(input_token_ids_, model_params_)

If we directly print outputs_token_ids, we would find that outputs_token_ids is also a kind of device Object without any actual values. To obtain the plaintext value of outputs_token_ids, we need to call the reveal API. Typically, in a PPML application, we only reveal data when all participating parties have agreed upon its disclosure. Running the last code snippet, we would observe that the text generation result performed through MPC is consistent with that of the non-privacy-preserving version, i.e. “I enjoy walking with my cute dog, but I’m not sure if I’ll ever”.

outputs_ids = sf.reveal(output_token_ids)
print("-" * 65 + "\nRun on SPU:\n" + "-" * 65)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
print(tokenizer.decode(outputs_ids[0], skip_special_tokens=True))
print("-" * 65)

Conclusion

Through this simple tutorial, we have learned some basics of PPML and how to use the SecretFlow framework to write our own PPML programs. This is just the beginning. The details of PPML and SecretFlow are much more fascinating and complex than what we have introduced here. If you are interested in PPML and SecretFlow, you can refer to the official documentation for more information, or join SecretFlow discord for more discussions.

Reference

https://chat.openai.com/
https://openai.com/sora
Yao, Andrew C. "Protocols for secure computations." 23rd annual symposium on foundations of computer science (sfcs 1982). IEEE, 1982.
Shamir, Adi. "How to share a secret." Communications of the ACM 22.11 (1979): 612-613.
https://github.com/secretflow/secretflow
Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI blog 1.8 (2019): 9.
https://github.com/google/jax
https://github.com/huggingface/transformers