Simulation-to-Reality Gap in Federated Learning — Part 1

A walk through some “existential” questions about training, validation, and test in federated learning

4 min readApr 11, 2022

Introduction

Simulation-to-reality gap, or reality gap, is a term that refers to the difficulty of transferring simulated experiences into the real world. It also refers to the difference between the promised potential of scientific and technical development and actual performance or use in practice. In Federated Learning (FL), this gap is significant. Moreover, many FL developers and researchers, especially beginners, get to know a lot about how to run an FL simulation, but not how to do real federated learning.

Before I go into more detail about the gap between simulation and reality in FL, you can learn about the concept of FL and its motivation at the following link.

What Is Federated Learning (FL)? Plot Twist: A Fairy Tale

FL tutorials and research papers are becoming more widespread. However, most of them are based on simulation. Beginners stumble over FL simulations more than real settings, making the real FL less familiar and the simulation, shall I say, more simulated?

After reading a lot of FL works, and once I had to look more closely to set the hyper-parameters for my first FL model, train it, validate it, and test it all by myself, some FL “existential” questions popped into my head. I was a real beginner then, and I was not the only one asking questions which reassured me. After some research, I decided to share the questions and the answers I found for them in this article, for all FL beginners out there (how pretentious this seems?👀)

The closer you look, the less you see

A big part of the discrepancy between simulation and reality in FL is caused by the virtual deployment of FL in the simulation which is very different from reality: hundreds of heterogeneous embedded devices with non-IID (heterogeneous distributions of) data. This makes the simulation performance (in terms of convergence time, consumed bandwidth, model quality, etc) far from the real performance. In these series, however, I do not want to focus on the performance gap, but on the process gap. More specifically, I will focus on the way by which training, validation, and testing are performed in FL, in simulation vs. reality.

Federated Learning in Simulation

The reason why FL simulation is prevalent is mainly real FL systems are expensive to set up (up to hundreds of data owners, real network, geographic distribution, data and hardware heterogeneity, etc.). Only a number of papers, usually published by large companies like Google, deploy their FL process (or a part of it, like inference) on real production systems. An example of those works is [1] where the authors presented an application of FL for mobile keyboard predictions.

Let us put on the cape of an FL developer who relies on simulations to build an FL model. The first thing is setting up the data. The learning labeled data is all stored in one dataset. Some famous datasets used in ML research are also used in FL research, such as MNIST. Data is then distributed to shards in a non-IID way, to simulate the data heterogeneity in an FL environment. A shard represents one client's data, with all clients running on the same machine. To build validation and test datasets, either portions of the data are cut from the global dataset before being distributed to shards, or portions of the data are cut from each virtual client to be combined and form validation/test sets. The data remaining on the virtual clients represent the training data. Once the data is set up, training can start. After the training, the model performance is tested on the test set. Research papers lack details on hyper-parameter/model validation. At best, the hyper-parameters used for local training are mentioned.

Although the three phases, namely training, validation, and test, exist in FL, the majority of research papers focus on the training phase. Moreover, Research papers completely ignore the role of the FL developer who is supposed to monitor the training, validate the model and launch the test. [2] is an exception to these papers.

The process described above raises some intuitive questions that we can easily answer in the context of machine learning, but that are difficult to answer in the context of real FL.

In real FL, data is already distributed and private across different clients. One cannot extract sets for validation and testing as in simulation. Who performs those actions in real FL systems, is it the server or the clients? and which data is used for the purpose?
In real FL, data labeling is a problem. It cannot be done manually as in FL since we do not have access to the training data. How is data labeled in supervised FL?
And finally, how is the tuning really done in real FL scenarios?

In the next part of this article, I will try to answer the raised questions, relying on information and details I got from some FL developers and/or found in research articles published by Google, where the settings of real-life FL systems were very much taken into account.

To be continued

!EDIT! Don’t miss part 2, available from here!

References

[1] Hard, A., Rao, K., Mathews, R., Ramaswamy, S., Beaufays, F., Augenstein, S., … & Ramage, D. (2018). Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604.

[2] Lai, F., Zhu, X., Madhyastha, H. V., & Chowdhury, M. (2021). Oort: Efficient federated learning via guided participant selection. In 15th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 21) (pp. 19–35).

Useful Links

FedML | The Federated Learning/Analytics and Edge AI Platform

Social, Secure, Scalable, and Efficient The community connecting and building AI anywhere at any scale An end-to-end…

fedml.ai

https://doc.fedml.ai/