What Is Federated Learning (FL)? Plot Twist: A Fairy Tale

About Machine Learning (ML), Big Data, Distributed ML and FL

Yasmine Djebrouni
6 min readJan 14, 2022

Once upon a time, there was only machine learning.

From Machine Learning to Distributed Machine Learning

Machine Learning (ML) is a set of methods used to extract knowledge from data and assist organizations in their decision-making processes, such as product recommendations for customers, disease diagnosis for patients, and object recognition for mobile device users. In its beginnings, machine learning was initially applied to small datasets that fit on single-node systems. In the last seven years, however, we started witnessing Big Data. Huge amounts of data have been generated and stored everywhere and in various forms, such as computer logs, social media accounts, health data, etc. These huge datasets, usually distributed across different locations and devices, hold great potential that can be discovered through ML techniques.

Machine learning is a data-driven technique. The more high-quality data that is fed into an ML system, the richer the knowledge source becomes and the more the system can learn. Machine learning is therefore ideal for exploiting the hidden patterns of Big Data. However, the size of Big Data has made the ML training phase impractical on traditional single-node platforms. Training times of ML models started to become very long. In some cases, a centralized solution was not even an option because the data was too large to be stored on a single machine. Some examples of such Big Data are transactional data of large companies stored in different locations [1], or astronomical data too large to be centralized on systems with only one node [2]. Even the use of accelerators such as GPUs and FPGAs could not help.

Till one day…

To ensure high predictive quality and high performance, and to make ML a viable solution, solution developers were oriented towards running ML algorithms on distributed systems, taking advantage of parallelization and high storage bandwidth. This has led to the emergence of Distributed Machine Learning (DML).

From Distributed to Federated Learning

In DML systems, the training process is distributed among several nodes. To facilitate the development of DML solutions, several DML libraries have been proposed, including MLlib, BigDL, and TensorFlow. These libraries allow users to easily run their ML algorithms on Big Data platforms such as Spark and Hadoop.

DML systems have brought many benefits to the ML community, including faster executions and the ability to analyze more data. However, despite their decentralization, traditional DML approaches still rely on some centralization. To learn from different data sets distributed across different locations, DML first aggregates all data in a central location (a data center in the cloud) and then distributes the aggregated data to nodes in that central location. For example, a bank collects data from different branches in different locations in a country and then distributes both the collected data and the training calculations to nodes designated for data analysis and located in main bank office. Data transfers to the central location incur unacceptably high costs, especially if the data is geographically dispersed, but the confidentiality and security of the data is also at great risk. All data transferred over the network is exposed to disclosure, attacks, and cyber risks.

Server server on the cloud… show me the data of all the crowd

The most serious data breaches of the 21st century include Equifax (with 143 million customers affected in 2017) [3] and Marriot (with 500 million customers affected in 2018) [4]. To address privacy issues, Federated Learning (FL) has been proposed.

Federated Learning (FL) & Some Applications

Federated Learning (FL) is a distributed machine learning that first appeared in 2017 in the work of McMahan [5]. It allows a set of data owners connected through a network to learn collaboratively a global shared model without sharing their data.

According to the FL paradigm, the learning process is usually orchestrated by a central server located at the center of the network and accessible to different data owners. The terminology may differ in federated learning systems. For example, the central server may be referred to as the coordinator, while the data owners may be referred to as parties or clients.

In FL, Parties can be mobile devices that collaborate to develop a user-centric recommendation model or geo-distributed bank branches that aim to learn a shared predictor of clients who are able to pay their loans or even hospitals that want to learn a global image classifier to apply on their patients' medical images.

Similar to classical machine learning, federated learning consists of several phases, namely a training phase, a validation phase, and a testing phase. During the training phase, parties create local models trained on their local data and send them to the server, which aggregates the obtained models and computes a global model. The validation phase takes place during training to validate the architecture of the trained global model. Finally, the testing phase verifies the performance of the global model before its final deployment on parties.

Compared to DML, FL is characterized by being:

(1) Communication-efficient, as no large amounts of data are transmitted over the network;

(2) Privacy-preserving, as data owners do not send their raw data, which may contain private information, to other data owners or to the cloud.

At the end of the FL process, and when FL developers are satisfied with their global model, the server sends the global model to the various parties that may or may not have participated in the training. There it is used to make predictions, such as predicting the next word you will type and magically displaying it in front of you, or deciding which video to show you next in your feed!

And they all lived happily ever after ~

We have almost reached the end of the article. To be fair, the last line was added mainly because it rhymes well with the beginning ;). Data scientists, product recommendation specialists, bank owners, etc., are happy with FL, but realistically we can not say that anyone has reached the Happy Ending yet. There are still some challenges that FL has to overcome, mainly because of the heterogeneous data and hardware of the parties that usually use federated learning. I will discuss this in another article.

Key takeaways

  • Machine learning techniques (ML) were initially applied to small datasets that fit single-node systems.
  • With the advent of Big Data, distributed machine learning (DML) has been proposed to run ML on distributed systems, allowing ML to benefit from parallelization and high storage bandwidth and scale to large datasets.
  • Classical DML relies on some degree of centralization. To learn from different datasets distributed across different locations, DML first aggregates all data in a centralised location, exposing sensitive data to disclosure, attacks, and cyber risks.
  • Federated Learning (FL) is a paradigm of distributed machine learning. It allows a set of data owners connected by a network to learn a global shared model together without disclosing their data.

!EDIT! You can find a detailed description of the federated learning training, validation, and test here: Simulation-To-Reality Gap in Federated Learning — Part 2.

References

[1] https://www.researchgate.net/publication/220490151_Accelerating_Large-scale_Data_Exploration_through_Data_Diffusion
[2] Raicu, I., Foster, I., Szalay, A. & Turcu, G. (2006). Astroportal: A science gateway
for large-scale astronomy data analysis, In Teragrid conference.
[3] Gressin, S. (2017). The equifax data breach: What to do. Federal Trade Commission, 8.
[4] Sanger, D. E., Perlroth, N., Thrush, G., & Rappeport, A. (2018). Marriott Data Breach Traced to Chinese Hackers. The New York Times, A1-L.
[5] McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017, April). Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics (pp. 1273–1282). PMLR.

--

--