Introduction to Federated Learning

Fabio Buchignani
Data Reply IT | DataTech

--

We are witnessing an era characterized by huge advancements in digitalization and artificial intelligence, but while in the past many were just research topics, a recent trend regards the exploration of AI applications in diverse industries and the integration with other emerging technologies such as the Internet of Things paradigm. The effort companies put in collecting more and more data and the algorithmic improvements made AI in production a thing, and nowadays a high number of non-technological companies use AI for some of their processes. Across industries requirements may vary, and new possible applications are continuously investigated. With such dynamicity, techniques in the field of Artificial Intelligence are continuously evolving and adapting to the ever-changing requirements of the market and of the general sentiment.

Furthermore, technology and computer systems became pervasive in such a way that there is almost no border between the physical and digital worlds, and data is continuously collected from the user. This led to an increasing need for data confidentiality to safeguard individuals’ privacy and to the development of laws in order to regularize data protection and movement of personal data. These laws, such as the GDPR implemented in 2018 by the EU, limit the control the companies have on the data they collect. So, while data is of course a basic requirement when training Machine Learning or Deep Learning models, there are restrictions that rightfully limit companies in their usage and movement of the collected data.

What is Federated Learning?

A branch in AI that addresses the issues previously mentioned is Federated Learning: Federated Learning is born as a technique to train ML and DL models in a distributed context, exploiting data coming from diversified sources without breaking the confidentiality requirements of the data owners. In other words, data owners can keep the collected data where they are legally allowed to store it, but could train a global AI model on the various datasets they own around the world, or even cooperate with other companies and institutions to obtain a more performant model. In this way, the participants can benefit from a model which has been theoretically trained on a larger and more comprehensive dataset, and nevertheless be able to guarantee their final users the adherence to the agreements in terms of privacy and data governance that are required by the law.

Federated Learning is conceptually easy: a federation may be composed of a variable number of actors and there may be server (called also coordinator or aggregator) or not. Each participant in the federation which has a dataset available is called data owner and acts as a client. The coordinator, when present, orchestrates the training process. This process starts with a global, shared model, is often iterative and involves the following steps:

  • The server sends the model to the clients and assigns them a task (for example, a round or more of local training).
  • The clients perform the task on their local data, update the parameters of the global model to obtain a local model, and send this model back to the server.
  • The server adopts an aggregation strategy that is often algorithm-specific to merge the local models into a new global model, then it starts a new iteration.

Of course, this is just a general view, as there are infinite possibilities on how to shape a round. The training processes may involve much more complex paradigms depending also on the algorithm used, including validation, client selection, aggregations strategies, efficiency-related improvements and more. We will cover some of the key topics and research areas in these fields in the next sections, but first let me show a basilar algorithm to train a neural network in a federated context.

FedAvg

FedAvg¹ is maybe the first federated algorithm that has been presented and probably the most known FL algorithm. In FL research, FedAvg often constitutes a baseline for the benchmarking of new algorithms and relies on a very simple aggregation strategy. The idea of FedAvg is, for each round, to propagate the neural network weights to the clients of the federated learning system, let them update the weights using a classical training algorithm on their local data and average the weights from the clients to obtain the updated neural network weights.

We can dive a bit deeper and take a closer look by also commenting some code. Let’s assume we are defining the behaviour of clients and server through python classes. Assume also that we are using an external python module available across all the federation, for example the federation_utils_common module. This module contains the definition of the neural network as well as utility functions that allow to streamline the presentation of the code.

Let’s breakdown each step of a round of FedAvg. Keep in mind that some initialization is needed both server-side and client-side. This is needed to bootstrap the neural network and configure the hyperparameters of the FL training algorithm (in the original FedAvg implementation, the hyperparameters were the fraction of clients selected at each round, the number of local epochs and the batch size). Let’s suppose for simplicity that the entire configuration is also available in the external module.

from federation_utils_common import get_model, get_config, get_participants
class Server:
def __init__(self):
self.config = get_config()
self.model = get_model()
self.clients = get_participants()
self.selected_clients = None
from federation_utils_common import get_model, get_config, get_server

class Client:
def __init__(self):
self.config = get_config()
self.model = get_model()
self.server = get_server()

At the beginning of each round, the server selects a subset of the clients and sends them the current global weights of the global neural network.

class Server:
def __init__(self):
# ...
def start_round(self):
selected_clients = client_selection(self.clients, self.config.fraction_of_clients)
for client in selected_clients:
send(client, self.model.global_weights)

Each client loads the weights received by the server into the local neural network. Then it trains the local neural network using the local dataset and the common configuration available in the external module.

Then, each client sends the weights of the trained local neural network to the server, along with the number of samples in the local dataset.

class Client:
def __init__(self):
# ...
def process(self):
global_weights = receive(server)
self.model.set_weights(global_weights)

self.model.fit(
x_train, y_train,
config.local_epochs,
config.local_batch_size
)

local_weights = self.model.get_weights()

send(server, (len(y_train), local_weights))

The server collects the information from all the clients and performs a weighted average of the weights coming from the clients. The new weights are substituted to the old ones in the global neural network. Then a new round begins.

class Server:
def __init__(self):
# ...
def start_round(self):
# ...
def aggregate(self):
n_local_samples_list, local_weights_list = receive(self.selected_clients)
n_tot_samples = sum(n_local_samples_list)
client_weights = [n_local_samples/n_tot_samples for n_local_samples in n_local_samples_list]
new_global_weights = sum_weighted_weights(client_weights, local_weights_list)
self.model.set_weights(new_global_weights)

The original algorithm terminates after a fixed number of rounds (another hyperparameter), but more complex termination conditions are possible.

FL vs Distributed Machine Learning

At a first glance, the FL technique may be similar to some kind of Distributed Machine Learning, which is a branch of Machine Learning that deals with the efficient processing of large datasets in a multi-worker environment, in such a way to parallelize the training algorithm and speed up the training process. Existing ML algorithms must be reshaped to a distributed context, and in some way the same process must be also applied to FL algorithms. However, there are at least some key differences between distributed ML and FL, with FL having some peculiarities:

  • In Federated Learning there is no movement of input data: this is a strict requirement that is usually not present in a classical distributed Machine Learning paradigm, where a master generally splits the data and distributes the partitions to the workers.
  • Each client is not just a worker. It is a fully-independent, fully-fledged device or group of devices. Clients decide on their own to join the federation process and they have full governance of their own data. They can also enter or leave the federation at anytime and the federation must be robust to disconnections.
  • Data distribution could be skewed, and datasets are heterogeneous. It stems from the fact that companies collect the data on their own and have their own population of clients with unique characteristics (country, age or welfare for instance). Not only, data could also be unbalanced in size, there may be clients with huge datasets and clients with sensibly smaller datasets.
  • Techniques are put in place to hide the private data during the training process. Some models embed characteristics of the dataset upon which they are trained, and adversarial attacks can be made on these models to infer the training set. You must ensure nobody has access to your data, and you must protect from “man in the middle attacks” (attackers listening on communication channels) and from the server and other clients as well, that are seen as independent parties. In order to do so, often sending just model parameters in plain text is not enough.

Up to now we assumed that every data owner has a dataset which is similar to the one owned by the others. Actually, this is just a branch of Federated Learning called Horizontal Federated Learning. A whole new branch is the Vertical Federated Learning, where clients in the federation own different datasets, but there are a number of entities (people for example) in common. The idea of vertical federated learning is to exploit the information present in each dataset about the same entity to obtain a larger number of features, and exploit it to get a more performant model. In other words, while horizontal federated learning works by theoretically adding rows to the global dataset (additional entities, same features), vertical federated learning aims at adding columns rather than rows (additional features, same entities).

Use cases and FL architectures

MRI data

When to use Federated Learning? Well, basically everywhere you need private data to remain private.

  • Financial sectors: banks or insurance companies may benefit from FL. They could federate to train an ML model which is trained on a huge number of customers also coming from different countries, while still keeping customer data private and within national borders.
  • Healthcare sectors: basically, the same as before. Hospitals, research centers could exploit private and sensitive patients’ data (such as MRI data) to collaboratively produce a global model that takes into account more data and is therefore more effective than the local models available to each institution. Patients’ data remain within the perimeter of the institution, in accordance to the law.
  • User data sectors: you would like to collect data from the smartphones or tablets of the users, for example text written on the keyboard, but this is certainly a kind of data that is personal, that should never leave the smartphone and that should never be available outside the device itself. Using Federated Learning you can train ML models on such data, without leaking the privacy of the user.
  • IoT data/IoMT data: as before, data collected from IoT devices or IoMT devices (Internet of Medical Things) could be legally accessible only on the device itself. You could federate these devices to train a global model without data ever physically leaving the device.

From use cases above, you can basically distinguish between two categories of Federated Learning:

  • Cross-silo Federated Learning: clients are companies/institutions. They are fully-fledged computing units, they have computing resources available and have high, reliable bandwidth. The number of clients can range from a small number up to some decades.
  • Cross-device Federated Learning: clients are small, often constrained personal devices. Limited energy, limited computing power, unstable internet connectivity are the peculiarities of this type of federation. The size of the federation is generally much higher and can reach thousands of clients simultaneously connected or more, connections and disconnections may be frequent and there may be a lot of dynamicity.

Both the FL architecture and the algorithms can be adjusted to fit into one or the other category. For instance, in a cross-silo setting clients may decide to go for a peer-to-peer architecture and avoid the unwanted presence of a third-party coordinator. And while a highly selective client selection technique will be beneficial in a Cross-device setting, in order to reduce the computing burden on the server during aggregation and also to reduce the stress on the constrained devices, the same technique applied to a low-size federation may harm the final performances of the global model.

Efficiency-related aspects

Neural networks are characterized by a huge number of parameters

Especially in a cross-device setting, it is important to address the limitations in terms of computing power and bandwidth. As you may know, many AI models are complex and require a lot of parameters, especially in the field of Deep Learning: neural networks parameters , their weights, may be in the order of hundreds of thousands, millions or even more.

A basic approach for the federated training of neural networks, as we have seen with the FedAvg example, is just to send back and forth the global model’s and local model’s parameters: clients instantiate a neural network whose architecture is previously agreed with the weights received from the server, perform a round of training and send the updated weights to the server. The server merges the updates from the clients in a new global model. With such an algorithm and so many weights, it is clear that sending the model parameters over the internet can sometimes be too slow, too energy-consumptive, or even unaffordable for small, constrained devices. Adaptations are therefore required to reduce the stress on the clients and improve the efficiency of the Federated Learning training process.

Which kind of techniques can be used to address this issue? Well, one consolidated approach I wrote about before is client selection: simply put, the client performs just some rounds of training and saves resources in the rounds it is not selected (how to select clients is a whole new point that should be considered, but let’s keep it apart in this article). But client selection is not the only possibility.

One technique which is widely used is gradient sparsification: the key idea is to remove the gradient updates (Consider the gradient the difference between the previous weight and the updated one) with low magnitude, which are not essential, so that the number of parameters to send decreases. If the server doesn’t receive an update from a client for a weight, it will consider it unchanged.

Another one which tries to solve the same problems is quantization: parameters are generally represented as floating point numbers on a certain number of bits. If you are able to reduce this number, you also reduce the communication burden on the clients. Quantization means converting the representation of a number on a certain number of bits to another representation on a lower number of bits.

An extreme example, which can be coupled with gradient sparsification, is the following: consider that only a fixed amount of change to the weights is possible. Each client could just send one bit for each weight: when the bit is set (equals 1) that means that according to the local model, the value of the weight should be increased; when the bit is reset (equals 0) that means the same, but the value should be decreased. If the bit is not present (how the server can recognize that the bit for that specific weight is not present is part of the gradient sparsification mechanism) then the value is the same. Of course, these techniques work only under specific circumstances as the overall accuracy may be hampered.

At the very end, compromising between efficiency and performance is part of the process of fine tuning of a Federated Learning training algorithm to a specific situation, and different configurations may suit different use cases.

The importance of protecting data in transit

As mentioned before, due to dealing with sensitive data, and because adversarial attacks are still possible on model parameters to infer the personal information of the private dataset, there are techniques used in Federated Learning to encrypt or mask the parameters in such a way to reduce the risk of data being exposed both to malicious users external to the federation, both to malicious users internal to the federation. You may observe that the server must know the parameters to aggregate them in a global model, but actually there are methods that allow at the same time to mask sensitive information and to keep the statistical value of the parameters. Thus, the server is able to produce a global model using the encrypted parameters and properties of the local models sent by the client without getting to know the parameters itself. These techniques can be grouped in three categories²: cryptographic techniques, perturbation techniques and anonymization techniques.

  • Cryptographic techniques: encrypting your data in transit, which is a very common approach in network systems. In the case of Federated Learning, you should use a category of encryption called homomorphic encryption, where the mapping space is such that particular properties are satisfied. For example, you could set up a homomorphic encryption system that given two messages m1 and m2 and a specific operator on them ⋆, maps the two messages in such a way that E(m1 ⋆ m2) = E(m1) ⋆ E(m2). Other cryptographic techniques in a distributed environment are the secret sharing method and the SMC (Secure multi-party computation), but as they get a little mathematical, I will skip their definition.
  • Perturbation techniques add noise to the original data in such a way that while perturbed data does not resemble the original data, the statistical information carried by the two datasets is indistinguishable. These techniques are simpler and more efficient compared to cryptographic techniques, but could degrade the quality of your data. You can look for differential privacy if you want to know more about these.
  • Anonymization techniques aim at achieving group-based anonymization, which means removing personally identifiable information of an individual by masking its sensitive information and making them indistinguishable from other individuals. An example here is the k-anonymity.

While these techniques may require additional effort from the parties in the federation, they are also paramount for the trusted execution of the Federated Learning algorithm. Again, there is no correct combination of techniques apriori, but that depends on the requirements of the specific federation.

FL frameworks

Due to the growth of Artificial Intelligence and the increasing sensitivity to confidentiality and data privacy topics I expect Federated Learning to gain additional interest in the next years. Even today, there are some FL platforms that have been already developed and used in practice, such as CAFEIN. Many companies are also experimenting on the field and have their own platforms, at different development stages, such as NVIDIA FLARE, IBM and Apple.

As I come to the conclusion, let me also invite you to take a look by yourself. You can find online two of the most common open-source FL platforms, which are named Flower and OpenFL. They can be used to both simulate federated experiments or set up real world environments.

Flower targets both simulated and real-world environments and it is agnostic to the ML or DL framework used, communication-agnostic, privacy-agnostic and generally flexible. You specify the configuration of the server and of the clients with Python files that implement the interfaces provided by the framework and then start the modules. There are already some algorithms implemented by default among which you can choose, but you can customize them to create and use your own.

OpenFL reference architecture

OpenFL is also written in Python language and works with TensorFlow and PyTorch by default, but you can extend the supported frameworks if needed. It is designed for real-world scalability, trusted execution and easy migration of centralized ML models into a Federated Learning pipeline. The reference topology is the classical star topology, with a server (called aggregator) and multiple clients (called collaborators), but mesh topologies are allowed.

As technology goes on, and because there are still many open problems, research is going fast and these frameworks are constantly evolving to keep up with the pace.

I hope you’ve enjoyed this introductory journey to Federated Learning, and thanks for reading.

References

[1] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-efficient learning of deep networks from decentralized data,” 2016. [Online]. Available: https://arxiv.org/abs/1602.05629

[2] X. Yin, Y. Zhu, and J. Hu, “A comprehensive survey of privacypreserving federated learning: A taxonomy, review, and future directions,” ACM Comput. Surv., vol. 54, no. 6, jul 2021. [Online]. Available: https://doi.org/10.1145/3460427

--

--