Federated learning: From platform independent libraries to open ecosystems

M M Hassan Mahmud
Digital Catapult
Published in
9 min readSep 10, 2020

(**Update 2020/01/11: A much improved version of the library described in this article, called dc_federated, has now been made open source. It is currently a beta version and has been designed for for consortium scale deployment.**)

Introduction

Building machine learning (ML) models typically requires a lot of data, but in many important, practical applications the data resides in silos that cannot be breached for privacy or confidentiality reasons. A canonical and very relevant example is the problem of building machine learning models to predict COVID-19 from medical images which reside in many different hospitals, located across many different jurisdictions. Federated learning (FL) is a recently developed distributed, privacy preserving machine learning technique that gets around this potential showstopper. Please see [1] for an excellent and comprehensive technical survey of the field.

But briefly, FL tries to solve the problem of building a machine learning model that is trained on data distributed across many different workers/clients without collecting or moving the data to any central location. This model is located at a central server and this server is in charge of guiding the federated learning process. The workers are potentially located in different networks and physical locations and have potentially different processing capabilities. During the training of the model at the server, the data never leaves the worker at which it originates. The data distribution is potentially different across the workers, but the data attributes/fields and the model architecture are typically assumed to be the same (though these constraints are being relaxed federated learning). See Figure 1 below a for a schematic illustration of the federated learning process and the sections below for more details. Please see this companion blog post for a general introduction to the federated, including examples of deployed systems, future applications and implications of FL in terms of business models.

In this article we will focus purely on how one may go about building an open, extensible and general library for federated learning, that lends itself easily to research, experimentation and subsequent live deployment. The discussion is rooted in our own experience in building a platform-independent, proof of concept library at Digital Catapult. We will also look at existing libraries and frameworks and discuss how the approach differs from these current approaches and why they differ. We will conclude with a look at the next step, which is how we may go about building an open ecosystem for R&D and application development in federated learning. This will help prevent a fragmented marketplace for federated learning solutions, help keep these solutions interoperable and also encourage innovation in this space.

A layered architecture for federated learning libraries

To motivate our design for our proof of concept (POC) library, it will be useful to understand at a high level the typical steps in a federated learning iteration (the central model at the server is trained using many such iterations). This will reveal what the dependencies are in a typical federated learning system which will automatically suggest a platform independent layered architecture for federated learning libraries.

The steps are illustrated in Figure 1. In the canonical scenario, a federated learning system consists of a set of workers and a central server. Each worker has access to some data, and can train a local machine learning model on the data it has access to. The data at the workers is assumed to have the same fields, but the data distribution may be quite different. In more recent work on heterogeneous federated learning, these constraints are being relaxed, and, as we will see, this assumption is immaterial to our design. The server is responsible for creating the global machine learning model that implicitly uses the data at the workers without ever seeing the data or transferring the data to the central location. The actual details of the way this is accomplished depends on the underlying algorithm, but in general typical federated learning iteration consists of the steps discussed below.

Figure1. Sequence of operations in federated learning.

Step 1. The local models at the clients are updated using the private local data:

This step consists of one or more iterations of a normal machine learning algorithm. For instance, if deep neural networks are being used, then this step may consist of possibly one or more batches or epochs of training of the local model on the local data using SGD etc. Hence this step depends on the application (medial image recognition, NLP, fraud detection, risk assessment etc.) and the platform chosen to implement the local model.

Step 2. The workers (or a subset of workers) sends the “local updates” to the central server.

This step consists of the workers sending a properly encoded, and possibly encrypted binary message describing the updates that worker has made in the Step 1, along with any meta-data such as worker identifying information. Hence, the content of the message depends on the federated learning algorithm and the platform chosen to implement the local model, but the message type (binary strings + meta-data) can be considered common across federated learning systems.

Step 3. The server aggregates the local updates into the global model.

In this step the server incorporates the updates received from the workers into the global model located at the server according to the logic of the federated learning algorithm that is being used. So this step depends on the federated learning algorithm and the platform chosen to implement the global model.

Step 4. The server sends the global model to all the workers.

This step consists of the server sending the properly encoded, and possibly encrypted binary message describing the updated global model to the workers, along with any meta-data. Hence, the content of the message depends on the federated learning algorithm and the platform chosen to implement the local model, but the message type (binary strings + meta-data) can be considered common across federated learning systems.

Step 5. The workers integrate the global model into their local model.

Finally, each worker incorporates the update received from the server in Step 4 into its own local model, and thus receives the updates the other workers have made to the model using their own private data. So this step depends on the federated learning algorithm and the platform chosen to implement the local model.

From the above discussion we can identify the following dependencies on the components of the federated learning system identified.

  1. Application + implementation platform (local or global).
  2. Federated learning algorithm + implementation platform.
  3. Independent of both the federated learning algorithm and the platform — i.e. common to all the federated learning systems.

Hence, we can break down a federated learning system into a classic layered architecture with three layers, one for each of the dependencies above. Each layer provides a well defined API but is otherwise independent of the other layers with all the details abstracted away. The most well known of these kinds of layered architectures is likely the TCP/IP stack that runs the internet. More precisely we define the the layers as follows:

Application layer

Specific domains or applications are implemented at the application layer which uses the API provided by the algorithm layer to implement application/domain specific logic for training, validating and testing the machine learning models. The core library can provide reference implementations for specific platforms, and suggest APIs — but users of the library are free to design their own versions of this. The implementation in this case would depend on the corresponding algorithm layer.

Algorithm layer

Specific federated learning algorithms are implemented in the algorithm layer, which uses the backend layer API to implement the communications necessary for specific algorithms. The core library can provide reference implementations for specific platforms, and suggest APIs — but users of the library are free to design their own versions of this.

Communication backend layer

The backend layer provides a platform and application independent API for workers and the central server to exchange messages regarding worker/server status and model updates. This layer would also guarantee scalability of the library, handle worker authentication and ensure the communication is secure. This layer would remain fixed and will not be changed by users of the library.

Our federated learning library in python created at Digital Catapult, called dc_federated implements this architecture. We provide reference API definition and implementation in pytorch for the application and algorithm layer, and a platform independent implementation of the communication backend layer. The library was initially developed as a tool to demonstrate federated learning to our clients, and has since been developed into a library that can be deployed for real applications. The communication backend provides worker authentication and management services, supports encrypted and compressed communication, and scales to hundreds of workers suitable for consortium level federated learning. The diagram below shows the architecture in a bit more detail for a specific algorithm and application. The repo currently has reference implementations for two application domains in the application layer (MNIST, PlantVillage) and one algorithm (FedAvg) in the algorithm layer.

Figure 2. Layered architecture for a the POC federated learning library created at Digital Catapult. This shows the architecture for a specific algorithm FedAvg and the MNIST domain.

The Future: Open ecosystems for federated learning

We motivate open ecosystems for FL by way of looking into why we created our library. In particular given that there are well known libraries like tf-federated (for tensorflow), pysyft (for pytorch), and frameworks such as FATE from WeBank and the Clara SDK for NVIDIA, perhaps it would have made more sense to use one of those?

The main reason we created this library was because none of these existing frameworks fit our use-case. We wanted a library that would let us demonstrate federated learning to our clients in a distributed multi-device setting easily. At the time of creating this library neither tf-federated nor pysfyt supported distributed multi-device deployment. FATE was complicated and seemingly required commitment to the overall API that was built, and the framework itself eliminated a fair bit of flexibility in terms of which models could be trained and how (particularly those cases requiring specialized or unique training approaches). Clara is not available as an open source library, and seems targeted toward medical applications. In the end we decided it would be much easier and quicker to build a library of our own, and we could build it according to an open, flexible platform-independent philosophy and design.

The experience made it clear that (1) there are parts of FL systems that are likely universal, particularly at the low level and (2) the federated learning solution space is in the process of becoming fragmented, with different big players in the market creating their own version of frameworks etc. The latter is leading toward a future where the FL solution space is fragmented, with many systems that cannot interoperate. This would lead to fewer choices for users of FL, and stifle innovation within this space. As an example of the kind of innovation that this would stifle, consider a scenario where there is a marketplace for federated learning solutions where a worker has the option to participate in one or more federated learning networks depending on the incentives. Clearly something like this is beneficial for both the users and for fostering innovation, and this can be achieved by

A. collectively agreeing on a common architecture for FL systems (e.g. a much more developed version of the architecture presented above), with a common, core API.

B. implementing the common aspects in a scalable, efficient, secure, open and transparent way using a open source solution

C. leaving the bespoke parts to be implemented and developed according to the needs of individual developers or solution spaces.

Indeed, the library we described above has been built along these lines and serves both as an illustration and implementation of the principles laid down above. We are currently using the library in our industrial research and commercial projects to further establish the efficacy of our approach. We are also taking the necessary steps to open source the library for general use and hope to do so very soon.

Conclusion

In this blog post, we took a look at federated learning and how we may go about building open and platform independent libraries for it. As an example, we discussed the proof of concept library developed at Digital Catapult and how it embodies some principles that help us arrive at a common core for federated learning systems. We finished by showing how this naturally gives us guidelines for creating open ecosystems for federated learning and why that is a highly desirable thing to do.

References

[1]Kairouz et al. Advances and Open Problems in Federated Learning. ARXIV: 1922.04977, 2019. https://arxiv.org/abs/1912.04977

--

--

M M Hassan Mahmud
Digital Catapult

Hassan is a Senior AI and Machine Learning Technologist at Digital Catapult, with a background in machine learning within academia and industry.