Setting Your Own Federated Learning Test Case

Musketeer
The Startup
Published in
5 min readNov 4, 2020
Image by Barbara A Lane from Pixabay

How to properly train your Machine Learning model? That’s definitely not an easy question and it depends on many different aspects. As it will take too much time to make a complete analysis of all of them, we will concentrate only on one of them: the amount of training data.The quality of a Machine Learning model depends on the volume of training data used during the training process. Small amount of data can produce low accuracy models that cannot be really usable. In this case, we can consider two options to solve the problem: (i) produce more training data by yourself or (ii) increase the training data volume by adding more data sources of the same kind. If the first option is not feasible (e.g. for technical or economic reasons), you can explore the second one by looking for other subjects with the same need. Here is where the concept of federation comes into play. In short:

The model that can be produced thanks to the collaboration of the federation participants is better than the one produced by each participant on their own.

This paradigm was initially introduced by Google and refers to different devices which collaboratively learn a shared prediction model while keeping all the training data on each device, decoupling the ability to do machine learning from the need to transfer data as well.The collaboration among the federation participants can be implemented with different levels of complexity and has to take into consideration other non-technical aspects. The simplest case is the one that concentrates all data in a single place and the training operation of the model is done using that single data repository. In this case, confidentiality and privacy should not be strong requirements. When the training data cannot be disclosed (e.g. business and/or legal reasons), a more sophisticated configuration has to be adopted. Every single federated participant will train an ML model locally (at their premise so not to send data outside) and will share only the model parameters. All the models produced by the participants are collected by a single subject that aggregates all of them and produces a new one that incorporates all contributions.

The MUSKETEER Platform

The main result of the MUSKETEER project is the implementation of an industrial data platform with scalable algorithms for federated and privacy-preserving machine learning techniques. The solution is based on the federation of a number of stakeholders contributing together, to build a Machine Learning model, in a collaborative (or coopetitive) way. Different roles are to be assigned: (i) the aggregator, starting the process and taking in charge the computation of the final ML model; (ii) the participants, taking part to a single Machine Learning model process, built using their own (local) training datasets.From the architecture point of view, the MUSKETEER platform enables the interoperability between a number of distributed big data systems (federation participants) by providing a mechanism to send and retrieve Machine Learning models. That interoperability mechanism is based on the principles defined by the International Data Space Association and that are formalized in their Reference Architecture Model [1]. The MUSKETEER platform architecture consists of a server side and a client side. The server part is hosted in the cloud and it makes use of message queues for asynchronous exchange of information among the federation participants, that are usually geographically distributed. One of the main activities of the server component is to coordinate the exchange of machine learning models between participants and aggregators. Besides the exchange of information for the execution of the actual federated learning tasks, the server side also provides services to manage tasks throughout their lifecycle, such as: creating new tasks, browsing created tasks, aggregating tasks, joining tasks as a participant or deleting tasks. The meta-information that is required for task management is stored in a cloud database.

MUSKETEER Client Connector components

The client side is represented by the MUSKETEER Client Connector that is a self-contained component that each user has to deploy on-premise in order to federate with the MUSKETEER platform.

We consider now the case where the training data is stored locally (e.g. in hard drives, NAS or removable devices that are attached to a single computer) and we want to make use of them to create predictive models without explicitly transfer datasets outside of our system. In this case, the MUSKETEER Client Connector can be deployed in any environment using Docker in order to containerize the Client Connector application itself. Docker containers ensure a lightweight, standalone and executable package of the software that includes everything needed to run the MUSKETEER Client Connector: operating system, code, runtime, system tools, libraries and settings. In this way the whole application can be easily made available in a sandbox that runs on the host operating system of the user.

The MUSKETEER Client Connector consists of five core components and two additional ones that are loaded (as external plug-ins) after the application is up and running: the communication messenger and the federated machine learning library.

  1. User Interface is a local web application that performs a set of functionalities where the main ones are: (i) to access to the target server platform; (ii) to connect the local data for executing the federated ML model training; (iii) to manage the different tasks for taking part in the federation.
  2. Client Back-End acts as a RESTful Web Service that handles all user requests, ranging from local operations (e.g. to connect user data to the Client Connector) to server operations (e.g. tasks and users management); these operations need to use a Communication Messenger library to communicate toward a target external server.
  3. Data Connector connects user data, which may come from different sources or storage layers, to the Client Connector. In addition to connect data from different source types, the component can manage and support different kinds of data: in fact, a user can load a CSV tabular data from the File System, images files, binary data, a table from a database and so on.
  4. Abstract Communication Interface allows to import and use an implementation of the communication library. In the MUSKETEER project the Communication Messenger library used is the pycloudmessenger library developed by IBM (available at https://github.com/IBM/pycloudmessenger). After such a library is configured and installed, the MUSKETEER Client Connector can use the APIs to communicate with the MUSKETEER cloud server provided by IBM.
  5. Execution component instantiates and runs federated machine learning algorithms according to interfaces defined by the Federated Machine Learning library imported into the MUSKETEER client Connector. In the MUSKETEER project, the FML library imported is provided by the partners Treelogic and Carlos III de Madrid University.

Give it a try

The first prototype of the MUSKETEER Client Connector is available as open source software from GitHub repositories:

Client Connector Backend: https://github.com/Engineering-Research-and-Development/musketeer-client-connector-backend

Client Connector Frontend: https://github.com/Engineering-Research-and-Development/musketeer-client-connector-frontend

Together with the source code you can find also the installation guide. We kindly invite you to download the Client Connector components and try it by setting up your test case of Federated Machine Learning process.We would be happy to receive your comments and feedback! For that you can use GitHub features or directly drop your message to: musketeer-team [at] eng [dot] it

Susanna Bonura

Industry and Security Technologies, Research and Innovation (IS3) Lab

Engineering Ingegneria Informatica spa

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 824988.

[1] https://www.internationaldataspaces.org/wp-content/uploads/2019/03/IDS-Reference-Architecture-Model-3.0.pdf

--

--

Musketeer
The Startup

MUSKETEER is an H2020 project developing an industrial data platform enabling privacy-preserving data sharing musketeer.eu