The Power of Federated Learning with Synthetic Data: A Perfect Symbiosis for Speed and Performance

Published in

Intel Tech

9 min readNov 2, 2023

An experiment demonstrating the improvements in Federated Learning when utilizing Synthetic Data.

In the rapidly evolving field of machine learning, federated learning offers a privacy-preserving method for training models on shared data across many collaborators who cannot share their data, typically for legal reasons (for example data protection laws, contractual restrictions, user consent issues, etc).

Imagine Hospital A wants to build the best COVID-19 mortality prediction model to improve patient care. It turns out Hospital B also wants to train the same model. If both hospitals work together and combine their data as collaborators, they can get a more accurate model than either hospital could individually. However, there is a caveat: due to privacy concerns and legal restrictions like HIPAA, Hospital B can’t easily send its data over to Hospital A (and vice versa). Federated learning allows these hospitals to collaborate on training a single model by aggregating independently trained model parameters and updates, rather than training on aggregated raw data (Fig.1).

Just like any privacy-enhancing technology, this comes with tradeoffs. If hospital A has a statistically different patient population than hospital B, as shown in Fig.1, (e.g. hospital A is in a rural town with limited healthcare access while hospital B is in an area with top-tier healthcare), it could significantly slow down training due to the model aggregation attempts to reconcile the local differences and generalize to the global population.

While the populations of different collaborators’ data may be significantly different, there are still important benefits to training models on more global data. While locally trained models may initially be performant on their local subpopulation, they can learn improper patterns due to the bias in their data. For example, one hospital’s data entry system may introduce a unique local relationship in their data that is learned by a model instead of learning a true pathological relationship. When the data entry system changes, the model will begin to fail. Even without structural local data issues like this, models trained on more global data will likely be better suited to deal with the inevitable natural evolution and change in local populations.

As a result, it may take a long time for the global model to converge on the optimal parameters for a global population, which incur trade-offs (compute and network traffic costs go up over time) and delay time to value.

This blog post explores an experiment to remediate this common challenge in a federated learning system using privacy-preserving synthetic data to bolster the performance by providing each collaborator with a global view of the project’s data, approximating a scenario where the source data could be physically aggregated and trained on.

The Problem: The Limitations of Federated Learning

Federated learning works by allowing many collaborators to train models based on their local data, then periodically aggregating local model parameters to build a single generalized model that captures patterns learned by each collaborator. In a perfect world, all the local models are learning similar patterns so aggregating them is simple because all the models are ‘pointing in the same direction’. Unfortunately, in many scenarios, all the models are not pointing in the exact same direction since different collaborators may have fundamentally different populations with different patterns to learn in the data. The more the local models are ‘pulling in different directions’, the slower model convergence is.

Slow convergence significantly hinders model development, as the time spent training often carries high infrastructure and networking costs, as well as slower iteration and experimentation time. The more time and money it costs to train a model, the harder it is to get the best model into production.

Hypothesis: Federated Learning using globally representative synthetic data will converge more quickly with comparable quality

The hypothesis was that if we could share a synthetic set of each collaborator’s data (every site) it would lead to fast convergence times without sacrificing performance (accuracy).

The intuition behind this is that synthetic data could be used to help each collaborator get a better view of the global data distribution, alleviating the problems raised by the heterogeneous nature of each collaborator’s data.

Federated learning and private synthetic data help make up for each other’s downsides. Synthetic data loses some signal to preserve privacy, the effect of which is mitigated by using it as a supplement rather than a full replacement. Federated learning is slowed down by the population differences from site to site which means no collaborator knows the true global distribution. Properly anonymized synthetic data gives each collaborator a view into the global distribution without leaking private information.

What is synthetic data?

Synthetic datasets are row-level approximations of source datasets generated from distributions learned by generative models. These generative models can learn complex multivariate relationships that are key to machine learning applications.

Synthetic datasets are not inherently private. Just like any ML model, a generative model can be overfitted and produce data that leaks private information or is re-identifiable. If collaborators want to safely share synthetic data with each other, they must prove that the synthetic data are sufficiently anonymized and meet the relevant legal standard (e.g. HIPAA or GDPR). Subsalt’s generative database facilitates the generation of synthetic data as well as automated audits to prove that the generated results are safe to share outside of traditional regulatory boundaries.

Experiment overview

We used OpenFL, an open source federated learning framework hosted by the Linux Foundation AI and Data and originally developed by Intel, to train all federated models in the experimen, Subsalt’s generative database was responsible for all synthetic data generation.

For this experiment, the target models were binary classifiers of outcomes from a publicly-available bank marketing dataset. We tested three key collaborator data setups to measure how convergence time (measured in epochs) would change in each setting.

Theoretical Max — Centralized Data: A single model can be trained on all the data at once. This is the optimal model training setting, as it requires no privacy enhancing technologies.

Control — Randomly Distributed Data: Simulate 5 different collaborators having ⅕ of the data each, selected at random. This is the optimal federated learning setup, as different collaborators all have data that resembles the global population.

Real World Simulation — Biased Distributed Data: Simulate 5 different collaborators having varying amounts of data from different biased clusters where each collaborators’ local data will not closely resemble the true global population. This setup creates friction in federated learning due to each collaborators’ data pulling their local model in different directions based on the local

For the distributed sensitive data setups, we tested scenarios with and without supplemental synthetic data to measure the potential gains in convergence time. Each collaborator generates their own synthetic data, which is then used by other collaborators to mimic the global data distribution without ever sharing real data.

In all cases, we controlled for federated model architecture and # of model updates to ensure that there were no performance benefits due solely to having more data to train on at each site, like more batch updates per epoch. We trained each model five times and took the average performance of each to ensure the results were repeatable.

Results: Synthetic Data + Federated Learning are Better Together

As expected, a single model trained with all data centralized in one place has the fastest convergence speed and optimal prediction accuracy, while federated models are slower to converge with minor prediction accuracy loss when data is biased across collaborators.

Regardless of the distribution of data across collaborators in a federated learning scenario, model convergence accelerates significantly when anonymized synthetic data is supplemented to each site. While models trained on centralized fully synthetic data do converge the fastest, that speed comes at the cost of peak performance, demonstrating the synergy between FL and synthetic data.

In the control setting (each collaborator has randomly distributed data), the model trained on synthetic data converges the fast, but at the lowest AUC. The model using only federated learning converges closer to the theoretical maximum performance in terms of accuracy, but with it takes nearly 3x number of epochs the synthetic data model. The model combining both techniques reaches the closest to the theoretical maximum ~30% faster in terms of number of epochs than the federated learning only model.

The results in the real world simulation (individual collaborator data is not a good approximation of the global distribution) are similar to that of the control. Synthetic data optimizes for speed, and federated learning for performance, and the combination gets the best of both worlds.

Additionally, the ratio of synthetic data added to the training process allows federation operators to balance convergence speed with model quality; adding more globally-representative data to each collaborator significantly improved convergence speed, but adding in too much eventually caused minor losses in model performance.

In a setup where each collaborator has ⅕ of the data, to get the closest approximation of the global distribution you would need ⅘ of the data to be synthetic. This introduces a heavy imbalance of synthetic:real data, which creates a speed/performance tradeoff where more synthetic data dilutes the performance benefit of federated learning and amplifies the speed benefit of synthetic data.

We measured this tradeoff by using varying ratios of real:synthetic data, controlled by the amount of synthetic data each collaborator contributed to its partner collaborator’s. To maintain a 1:1 ratio of real:synthetic data with 5 collaborators, each collaborator should get 25% of the synthetic data from each other collaborator.

In the control, this tradeoff became apparent as the ratio of real:synthetic data fell. While convergence speed strictly increased as the amount of synthetic data increased, using more than 25% of the synthetic data showed sub-optimal performance.

This trend also held true in the real-world simulation.

Conclusion

The results of this experiment support the hypothesis that federated learning efforts can be significantly accelerated with the use of privacy-preserving synthetic data, especially in scenarios with imbalanced datasets between collaborators. In practice, it results in more cost-effective and rapid model development within these settings

Supplemental synthetic data can reduce time and cost to build federated learning models that combine the insights of many different collaborator’s data, without ever having to transport sensitive data across sites.

There is also early evidence of other synergies between federated learning and synthetic data that deserve further research, such as the ability to pre-train models on synthetic datasets prior to federation for a “hot start” or provide generalized global validation datasets at model aggregation time

About the authors

David Singletary (Subsalt) : CRO
Luke Segars(Subsalt) : CTO
Dylan Moradpour(Subsalt) :Founding Data Scientist

Sheller, Micah J (Intel): Senior AI Research Scientist

Xin Chen (Intel): Machine Learning Software Engineer

Prashant Shah (Intel): Global Head of Artificial Intelligence, Health and Life Sciences

Get Involved

There are many ways to get involved with OpenFL, such as trying tutorials, reading blog posts that explain how to train a model using OpenFL, and checking out online documentation that can help you launch your first federation.

If you’re already an expert, we encourage you to contribute to the community by solving issues or writing a blog post. You can also join our monthly virtual community meetings in your region. You’ll find all the info in the GitHub* repo.

For more open source content from Intel, check out open.intel