Federated Learning: The Next Big Step Ahead for Data Sharing

BCG GAMMA editor
GAMMA — Part of BCG X

--

by Arun Ravindran, Aparna Kapoor, Piyush Mishra and John Gomez

AI adoption and data maturity within enterprises have seen significant growth in the past decade. With each passing day, new enterprise AI use cases come to life in more organizations and more industries. Some enterprises, especially those at the more mature end of this spectrum, have made tremendous progress productizing their data science capabilities, and have set up large data and model capabilities to fuel their customer growth.

Many of these same companies are looking for ways to make more data accessible to their AI programs and increase business growth — but to do so in a way that protects data privacy. As this exploration continues, we are seeing two emerging industry trends. First, organizations that have leveraged most of their existing first-party data assets are searching for newer dimensions to help them understand their customers and grow their businesses (BCG, 2020). Second, organizations that continue to struggle with sourcing their first-party data (e.g., CPG companies) are looking for ways, particularly through external partnerships, to increase visibility into their customers using data enrichment.

Until a few years ago, purchasing second- and third-party data from external vendors was about the only way enterprises could access new information about their customers, stores, or products. Once they acquired the data, the companies then had to blend those assets to enrich their understanding. Platforms that offer data-exchange (Saulles, 2020) or data-sharing services (Data Republic, 2021) have definitely simplified access to such offerings and have, as a result, seen strong adoption within specific industries. Such exchanges, though, have multiple shortcomings, ranging from sub-optimal granularity (e.g., zip code level information instead of socio-demographic data on individual customers) to poor quality (e.g., slower, manual updates to the data, or sparse coverage). On paper, an inter-organization data consortium could allow multiple entities to exchange information so all parties could learn more about their customers and their growth drivers. In practice, however, there have been very few such partnerships, mainly because of each company’s concerns about privacy, differing governance structures between entities, poor infrastructure, high costs, and regulatory challenges.

While all this has been taking place in the foreground, Federated Learning has emerged in the background as a new paradigm for collaboration and partnership between enterprises. Federated Learning (FL) enables companies to share data in a “closed-loop system” to build a common, powerful machine learning model — and do it without actually exchanging data. This single capability may soon enable companies to vastly improve customer insight while addressing such critical issues as data privacy, data security, data-access rights, and access to heterogeneous data.

In the following article, we will highlight the current Federated Learning landscape. We will do so by assessing FL current progress and providing high-level predictions of when different industries will be able to start using FL to share data and models to drive individual actionability and solve specific use cases. We will also discuss obstacles that must be overcome before large-scale adoption of Federated Learning will be possible. Some of these obstacles are technical and must be addressed by data scientists. The solution for other, more organizational obstacles including overall strategy, with whom to enter into agreements, and whether the technology is sufficiently mature, will reside in the executive suite.

Data Sharing Without Data Exchange

Federated Learning is broadly defined as “a machine learning setting where multiple entities (clients) collaborate in solving a machine learning problem, under the coordination of a central server or service provider”[1] without clients having to actually exchange their data. In this setting, the central server drives a cyclical process in which:

1. The central server first sends out the current version of the global model to a subset of client devices.

2. Each device then produces updates to the model parameters based on their individual datasets, such as via stochastic gradient descent.

3. The devices then send their client-model updates back to the central server.

4. The server aggregates the updates to produce an updated version of the global model.

5. The server then sends out the new version to the client devices, and the entire process repeats itself.

This cyclical process can be effective — but relying on a single server can be risky. The central server is a single point of failure. It needs to be able to support large amounts of data transfer between itself and the clients. And it has to be trusted by all parties in the network.[2]

There are still other major risks associated with FL — and with distributed machine learning in general. Bad actors may attempt to compromise the global model, such as through model poisoning[3] or by trying to access or recreate clients’ private datasets.[4] These risks can be addressed using decentralized or blockchain-native approaches such as randomized verifier peer selection and by model-update validation in combination with traditional privacy safeguards.

Decentralized FL on the Horizon

While traditional FL is well on its way to becoming an established technology with multiple industrial-grade frameworks (OpenMined, 2021) (http://www.fedai.org, 2021) (IBM, 2021), fully decentralized FL is still very much a work in progress. To hasten the development of decentralized FL, ongoing research is focusing on decentralization of the learning process via peer-to-peer distributed learning. As this research progresses, decentralized FL may be able to overcome the risks inherent in standard FL when a central server drives a cyclical process.[2,5,6]

Blockchain, also known as distributed ledger, could make fully decentralized FL possible. In a naïve implementation of one such approach, updates from each network participant could be written directly to the ledger. Participants could then download the updated ledger to update their local version of the global model. The decentralized approach has the benefit of reducing the batch size for global-model updates to the number of updates written to each block in the ledger. And it allows the model to be nearly continuously updated.

Before we see widespread industrial adoption of ledger-based decentralized FL, the research community will have to come together around the best approaches to solving the open questions about decentralized FL, and industries will have to adopt blockchain technology. Given the rapid rate of standard FL advancements, once these two steps are taken decentralized FL platforms could become more widely available within the next 2–4 years. But some organizations won’t have to wait that long.

The Role of Blockchain

In our view, many of the inter-institution applications of FL need not wait for commercial-ready, fully decentralized FL, especially when only a small number of institutions are involved. Decentralization is needed most when the interacting parties don’t know each other and can’t agree on a trustworthy intermediary. But if, for example, a large bank wanted to collaborate with a large credit card company to build a fraud-detection machine learning model, they could proceed without using a fully decentralized model. Each company’s interest in the model’s success — and in protecting their reputations and reliability — could be enough to convince them to incorporate traditional FL using a mutually acceptable central server.

Potential Real-World FL Applications

Federated Learning promises to revolutionize a wide range of digital use cases. In healthcare,[7] it could, in principle, be applied to manage many state-of-the-art machine learning-driven healthcare tasks, train joint computer-vision models to predict patient outcomes such as complications and re-hospitalization, detect anomalies and make diagnoses based on other medical-sensor data and electronic health records (EHR), and address a host of other additional cross-institutional digital-health issues. FL could be used to reduce development costs while increasing the accuracy of marketing, sales, and pricing products such as hyper-personalized recommendation systems and e-commerce pricing engines; improve the accuracy of personalization algorithms; and help companies vastly increase their understanding of consumer purchasing habits and, thus, increase sales.

Fraud analytics & anti-money laundering are ideal starting points for FL financial applications and are, in fact, an active area of both ongoing research and commercial implementation (FinRegLab, 2020). Credit card companies and banks, by sharing data, expose their models to more data about the broader distribution of illicit or fraudulent financial transactions. In manufacturing & mining, federation between machinery suppliers and machinery purchasers or operators would very likely lead to improvements in predictive maintenance. Various suppliers might also, for example, federate their data to build better predictive models of overall supply chain performance. Federation among providers of smart home devices and utilities could result in mutually beneficial models for predicting power consumption and device usage. The result would include both increasingly responsive and personalized devices, and more efficient electrical-grid management. Similar joint-model development and insight sharing is possible for water, waste management, internet, and similar utilities and services.

Perhaps more than any other application of machine learning in general or deep learning in particular, computer vision and natural language processing require access to very large amounts of data. While several tech companies have built and continue to develop sophisticated natural-language processing systems, they are able to collect only a fraction of the myriad samples of human speech. Similarly, many companies build and develop computer-vision for applications such as self-driving cars. Once again, each individual company has access only to its own massive — but massively incomplete — datasets of the relevant segments of human experience.

Clearly, cross-institution Federated Learning in the above industries could result in more intelligent systems. One unanswered question in applications such as these, though, is whether jointly developed computer-vision and natural-language processing models would reduce the competitive advantage of each individual contributor. If the benefits of joint development are scientific rather than financial, tech companies may not be sufficiently motivated to enter into this sort of arrangement. As such, Federated Learning in the tech sector is likely to be limited in the near term to cross-device learning among a single company’s devices (Apple, 2021).

Potential Pitfalls and Outlook

In general, Federated Learning holds great potential across many social sectors to unlock the power of AI by giving it access to otherwise sensitive, siloed data. Even so, we anticipate that industry-wide adoption will occur in incremental steps. Healthcare, which perhaps stands to benefit most from Federated Learning, will likely be among its slowest adopters, given the enormous moral and legal implications of data-privacy breaches within this industry. While the discussion of FL security and privacy is ongoing[8], industries such as healthcare are not likely to make large-scale investments in the technology until the risk/reward trade-off for industries with lesser privacy concerns become low enough to attract healthcare industry attention. At the same time, many current large-scale commercial applications such Google’s Gboard [10] already take advantage of FL’s technological benefits, including improved efficiency of model training and development.

Generally speaking, the success of Federated Learning efforts between companies will require sufficiently aligned motivations, goals, and data resources. Some ongoing research seeks to monetarily award participants in a blockchain-based, fully decentralized Federated Learning network, using a cryptocurrency based on the same blockchain.[4] This approach alone probably won’t be motivating enough to convince companies to participate in Federated Learning arrangements. There first needs to be sufficient additional upside from the development of the model itself, and the associated cryptocurrency must first achieve widespread adoption.

Even when companies stand to benefit from the FL joint-model-development approach, there is no guarantee that the resulting model will be better able to address each individual company’s use cases. Simply put, different companies’ or individuals’ datasets are not independent and identically distributed (IID), a situation that violates the fundamental assumption of most modern machine learning techniques. While FL models based on non-IID datasets have been shown to be better for some applications than independently trained models,[1] there is no guarantee that this will always be the case. For example, if two credit card companies with orthogonal customer sets were to federate their data to build a personalized credit card offer system, the model should theoretically be better able to generalize to the entire two-company customer dataset than either company’s model could on its own. But if the two customer sets were considerably different, the joint model might perform worse when each company used it to serve offers to its customers.

The widespread use of Federated Learning faces many challenges. But FL holds tremendous potential to enable companies in virtually any industrial sector to gain insights that are all too often obscured in siloed data. With each passing day, more technology players expand their FL offers. Each day, more new players enter the FL market. And each day, FL continues to demonstrate superior efficacy and security. Given these trends, we are very confident that FL adoption in areas of increasing privacy needs will increase — as has the adoption of machine learning, cloud computing, and IoT in recent years.

In fact, in many ways FL represents the convergence of these technologies, each of which has advanced our collective ability to learn from data in ways previously beyond our imagination. We expect that Federated Learning — and, subsequently, decentralized Federated Learning — will experience similar trajectories over the next few years, becoming de facto standards for distributed machine learning, especially in those industries where privacy is a major concern. In our estimation, all industries that rely on data would do well to pay close attention to emerging developments in Federated Learning security and privacy.

References

[1] Kairouz, Peter and H. Brendan McMahan and Brendan Avent and Aurélien Bellet and Mehdi Bennis and Arjun Nitin Bhagoji and Kallista Bonawitz and Zachary Charles and Graham Cormode and Rachel Cummings and Rafael G. L. D’Oliveira and Hubert Eichner and Sali. (2021). Advances and Open Problems in Federated Learning. arXiv, 1912.04977.

[2] Pappas, C., Chatzopoulos, D., Lalis, S., & Vavalis, M. (2021). IPLS : A Framework for Decentralized Federated Learning. arXiv, 2101.01901.

[3] Bhagoji, A. N., Chakraborty, S., Mittal, P., & Calo, S. (2019). Analyzing Federated Learning through an Adversarial Lens. arXiv, 1811.12470.

[4] Hitaj, B., Ateniese, G., & Perez-Cruz, F. (2017). Deep Models Under the GAN: Information Leakage from Collaborative Deep Learning. arXiv, 1702.07464.

[5] Ramanan, P., & Nakayama, K. (2020). BAFFLE : Blockchain Based Aggregator Free Federated Learning. arXiv, 1909.07452v3.

[6] Shayan, M., Fung, C., Yoon, C. J., & Beschastnikh, I. (2018). Biscotti: A Ledger for Private and Secure Peer-to-Peer Machine Learning. arXiv, 1811.09904.

[7] Rieke, N., Hancox, J., Li, W., Milletarì, F., Roth, H. R., Albarqouni, S., . . . Xu, D. (2020). The future of digital health with federated learning. npj Digital Medicine, 3, 119.

[8] Kaissis, G. A., Makowski, M. R., Rückert, D., & Braren, R. F. (2020). Secure, privacy-preserving and federated machine learning in medical imaging. Nature Machine Intelligence, 2, 305–311.

[9] Shayan, M., Fung, C., Yoon, C. J., & Beschastnikh, I. (2018). Biscotti: A Ledger for Private and Secure Peer-to-Peer Machine Learning. arXiv, 1811.09904.

[10] Yang, T., Andrew, G., Eichner, H., Sun, H., Li, W., Kong, N., . . . Beaufays, F. (2018). Applied Federated Learning: Improving Google Keyboard Query Suggestions. arXiv, 1812.020903.

--

--