Building Secure Aggregation into TensorFlow Federated

Jason Mancuso
Aug 20, 2020 · 10 min read

With Morten Dahl and Yann Dupis.

Summary. We introduce a secure aggregation protocol into TensorFlow Federated based on encryption primitives from TF Encrypted. This post describes our motivation, presents the protocol, and highlights interesting details of our implementation. We assume some technical familiarity with TensorFlow Federated, secure aggregation, Paillier homomorphic encryption, and public key cryptography. We’ve open-sourced our code here.

The post is structured as follows:

  • Overview
  • Protocol & subprotocols
  • Implementation details: Integration strategies
  • Implementation details: Implementing the secure sum


The TensorFlow ecosystem currently has three core pillars for privacy-preserving machine learning (PPML):

  • TensorFlow Federated for federated computation (TFF)
  • TensorFlow Privacy for differentially private learning
  • TF Encrypted for encrypted computation (TFE)

TensorFlow Federated (TFF) provides the Federated Core language for federated computation, as well as a set of higher-level Federated Learning APIs. These could easily benefit from primitives for encrypted computation built into TF Encrypted (TFE). For example, one may want to protect client data from the server in a federated platform like TFF, in which case some kind of secure aggregation protocol is necessary. Ideally, using TFE to build secure aggregation into TFF would be seamless. However the libraries developed separately, so there are no obvious integration points.

This situation led Morten to author an RFC for integrating TFE with TFF. As a first attempt, our team decided to build a specific secure aggregation protocol in TFF using primitives from TFE. We implement a passively secure version of the tff.federated_secure_sum intrinsic, which is currently left unimplemented in TFF’s native backend. Our work is aimed at users of the Federated Core, but could eventually be exposed to the higher-level tff.learning API as a custom tff.templates.IterativeProcess (e.g. for use with Keras). For those interested in using it to train models, the recently added tff.utils.secure_quantized_sum would be a great place to start.

We use Paillier homomorphic encryption to perform the secure sum, and we use standard public key cryptography to ensure no party can learn values that they shouldn’t have access to in transit. We differentiate between two central parties, the Server (holding the Paillier decryption key) and the Aggregator (aggregating Paillier encrypted values). We assume that these two parties will not collude. Our implementation is designed with cross-silo federated learning between organizations in mind. Cross-silo is distinct from cross-device learning in that organizations can typically provision highly available servers of varying compute resources, which is generally not true of cross-device learning on mobile phones or embedded devices. In particular, some of the dependencies we rely on may not be suitable for mobile/embedded execution.

Protocols & subprotocols

Our protocol is a composition of two well-known primitives — Paillier homomorphic encryption and authenticated public-key encryption. We assume the existence of three distinct sets of parties; several Clients, one Server, and one Aggregator. These correspond to placements in the TFF language (note that the Aggregator is a new placement we’ve introduced). The goal of the protocol is for the sum of the Clients’ input values to be realized on the Server, without the Server learning any of the input or intermediate values. Informally, we encrypt the client values according to the Paillier crypto-scheme, and then send these ciphertexts to the Aggregator to perform the encrypted addition. The Server maintains the only copy of the decryption key, so neither party ever has possession of the clients’ ciphertexts and the decryption key, thereby preserving the privacy of the clients’ inputs. The animation below illustrates this secure sum.

However, since the Aggregator placement is something we’ve introduced, TFF does not support direct communication between the Aggregator and Clients. In particular, TFF requires all communication to pass through the driving Python script (aka driver program), and this program is usually collocated with the Server role. Thus, in order to pass values from the Clients to the Aggregator, we must route the Clients’ ciphertexts through the Server. However since the Server holds onto the Paillier decryption key, they will have a temporary opportunity to decrypt client ciphertexts, which would be a security violation.

We use a second protocol for securely communicating values from the Clients to the Aggregator, with the Server acting as an honest-but-curious router. This subprotocol uses a standard encryption implementation from libsodium for authenticated encryption. The graphic below illustrates the details of this protocol.

Composing these two subprotocols together yields our complete secure aggregation. They compose cleanly, since the secure communication protocol is only replacing the communication of ciphertexts from the Clients to the Aggregator in our idealized Paillier aggregation. In our implementation, this corresponds to a double-encryption of the raw client inputs, where the inner encryption layer is based on Paillier and the outer encryption layer uses libsodium. The entire process is illustrated below.

This concludes the presentation of our protocols. If you’re interested in the code, it has been open-sourced here. Please try it out, and let us know what you think!

Integration Strategies

TF Federated is a complex framework, and even a simple aggregation like tff.federated_secure_sum is non-trivial to re-engineer. In the rest of this post, we describe our integration strategy and implementation in more engineering detail. While we learned a lot throughout the process, much of our rationale comes from our original RFC. We describe three potential integration strategies for adding new execution patterns to TFF:

  1. Non-native backend
  2. Native backend: custom remote execution via gRPC endpoint
  3. Native backend: custom Executor

We then analyze why none of these are immediately suitable for our needs, and detail our chosen integration strategy. Note that this is not an exhaustive list of the different development APIs that TF Federated supports.

We are hoping to implement a specific functionality in TFF’s Federated Core (FC). As mentioned in the official documentation, the FC is a functional language for federated computations. Thus, in TFF, how users express federated computations separates cleanly from how federated computations are executed. This presents the first potential integration strategy, through what’s known as a non-native backend. In this strategy, we would only use TFF for the FC and its compiler; the native TFF backend, including the entire Executor stack framework, would be left out. The native TFF runtime has a number of nice features that we would like to benefit from, including automatic thread management, GPU support through the TensorFlow runtime, as well as a gRPC service for remote execution. We would also rather not reimplement federated intrinsics that the native backend provides, for example tff.federated_map, tff.federated_broadcast, and the rest. However, the native backend can also be limiting for certain use cases. For example, clients’ processes can only ever inspect local computations that they are responsible for, since the native backend’s usual Executor stacks implicitly erase information about the global federated computation during compilation. Additionally, the native backend seems to assume that the driver program is collocated on the Server’s host, which introduces a communication bottleneck with the security implications mentioned earlier. Both of these considerations lead us to believe that the native backend would not be sufficient for a production ready system. However, since production-readiness is beyond the scope of our prototype, we decided to work with the native backend for its other benefits.

One way to integrate with the native backend would be to implement a gRPC endpoint for the client-side of TFF’s RemoteExecutor service. This is most useful for those who want to replace the Python-based client runtime without throwing away the entire server-side of the stack. For example, if we only wanted to add a client runtime for training on mobile phones, this integration strategy would be perfect. However, we don’t want to replace the entire client runtime here; we just want to add a few operations to enable client-side encryption. We also need to introduce functionality not available in the native client runtime for TFF, including server-side decryption and a new Aggregator placement, so this would not even be sufficient for our needs.

Finally, we can implement a custom Executor for use in the native backend. These executors are designed to be completely modular with each other, so handling of federated computations is kept separate from features like remote execution, caching, thread management, and execution on the TensorFlow runtime. For our purposes, this immediately offers some benefits over the prior two strategies. If we concern ourselves exclusively with the executors involved in federated orchestration, we can retain most features of the native backend, and we also won’t have to implement an entire client runtime.

After some investigation, we found that there were multiple executors involved in federated orchestration, including (at the time) the FederatedExecutor and the ComposingExecutor. Introducing an entirely new Executor class to replace these seemed excessive for our goal of adding a single intrinsic. Our initial stab at doing so coincided with some refactoring the TF team was planning for these Executors. We began collaborating with them on these changes, resulting in the tff.framework.FederatingStrategy released in v0.15.0. We use this FederatingStrategy abstraction to implement our own PaillierAggregatingStrategy, which modifies the behavior of FederatingExecutor to include our implementation of tff.federated_secure_sum.

Implementing the secure sum

Our tff.federated_secure_sum implementation is itself noteworthy, because it requires functionality not included in TensorFlow’s default runtime. Since we’re relying on TFF’s native backend, the TFF compiler is responsible for transforming tff.federated_secure_sum into a series of TFF Computation protobufs. These Computations often represent TensorFlow graphs, which are eventually executed by the TensorFlow runtime on the Server, the Clients, or the Aggregator. According to the protocol above, here is a list of operations we need to make available to the runtime:

Paillier ops:

  • Keygen (primitives for RSA), or key import
  • Encryption
  • Addition
  • Decryption

Ops for secure communication:

  • Keygen
  • Encryption
  • Decryption

While the Paillier ops are relatively simple to define mathematically, they require high-precision integer data types that TensorFlow doesn’t support natively. We use the GMP library for its multiple-precision arithmetic primitives, and expose it to TensorFlow through the runtime’s ABI for custom ops. These GMP ops are made available to Python in the TF Big library, which we depend on for our TensorFlow implementation of Paillier primitives. TFF ingests the Paillier primitives as TF computations, i.e. Python functions decorated with tff.tf_computation. The PaillierAggregatingStrategy is then responsible for applying the TF computations to federated types to realize the Paillier Aggregation protocol presented above.

A similar process works for the Secure Communication protocol. We use libsodium to define custom TensorFlow ops that provide the primitives needed to encrypt/decrypt TF tensors for an authenticated channel. We then expose these primitives in Python and decorate them as TF computations to be able to use them with federated inputs. Instead of using these computations directly in PaillierAggregatingStrategy again, we abstract the secure communication protocol out into a separate Channel class. The two abstractions are only responsible for implementing their own subprotocols, which allows us to test and verify them separately before composing them together into a single protocol.

Finally, to allow users of the Federated Core to experiment with our work, we expose a custom executor factory function. This factory function is responsible for constructing an execution context that can support our new PaillierAggregatingStrategy. In order to pin computations to a new Aggregator party, we add it as a PlacementLiteral backed by its own stack of executors identical to those used by the Server placement. In reality, this is a shallow modification, since this placement is unknown to the compiler. Replacing the default local_executor_factory function with our own local_paillier_executor_factory adds the ability to use tff.federated_secure_sum in simulations, as illustrated in the code snippet at the beginning of this post.

This work accomplishes a number of goals we had for prototyping an integration with TFE and TFF. As a first step, it showed how one can embed custom, crypto-friendly primitives into TFF using TensorFlow’s custom op interface, and also use those primitives to build secure aggregation protocols.

We thought of several interesting extensions to our protocol that we leave as potential future work. First, exposing this protocol to the tff.learning API would make it more easily consumable; this should be fairly easy to do by writing a builder function that produces a tff.templates.IterativeProcess. One might also wish to use this secure aggregation protocol with remote execution, which can be done by implementing a new executor factory similar to local_paillier_executor_factory that uses RemoteExecutors instead of EagerExecutors (perhaps similar to worker_pool_executory_factory). Another interesting direction would be to experiment with different encodings of client data, including more efficient packing of ciphertexts and using quantized data instead of standard integer data types. Finally, exposing a general purpose secure aggregation functionality into TFF is still an open problem that we hope to one day solve, per our RFC. If you are interested in any of these, please do reach out!

TensorFlow Federated is an amazing library. The functional Federated Core API is clean and modern, and the Federated Learning APIs are constantly being updated to support Google’s own research plans for FL. While it’s not a feature complete framework for production-level federated learning, it is extremely thoughtful in the way it models federated computations in the Federated Core language. The team has been quite clear about their goals and non-goals, and has always gone the extra mile to help us whenever we had a question or an issue.

Finally, we’d like to thank Michael Reneer for his continued support of our development with TFF, as well as Katharine Jarmul, Dragos Rotaru, and Keely Chamberlain for helping review this post.

About Cape Privacy

Cape Privacy is an enterprise SaaS privacy platform for collaborative machine learning and data science. It helps companies maximize the value of their data by providing an easy-to-use collaboration layer on top of advanced privacy and security technology, helping enterprises increase the breadth of data included in machine learning models. Cape Privacy’s platform is flexible, adaptable, and open source. It helps build trust over time providing for seamless collaboration and compliance across an organization or multiple businesses. The company is based in New York City and is backed by boldstart ventures and Version One with participation from Haystack, Radical, and Faktory Ventures.

Cape Privacy (Formerly Dropout Labs)

Privacy & Trust Management for Machine Learning