The Future Is Federated

Published in

The Startup

6 min readDec 21, 2020

Balancing the power of machine learning and privacy

Check out more of my writing here

The potential applications of machine learning for earlier disease detection was one of the first things that drew my interest to machine learning more broadly. In high school my dad found out he had kidney cancer, but only after a few years of what appeared to be random organ shut downs, and visiting many, many doctors who had no clue what was happening. The doctor who did make the diagnosis had access to a searchable repository that tied a few repeated cases of renal cell carcinoma (RCC) to these seemingly random shut downs, which left me thinking about why he hadn’t been able to access that information sooner.

This drew me into the privacy preserving machine learning space, and when I was 19 I applied to INOVA hospitals accelerator to research potential applications of ML to better understand biomarkers like miRNA concentration for better RCC diagnosis — as I came to understand HIPAA and the broader complex regulatory space medicine was stuck in, I began to research federated learning as well.

In 2017 Google published this research detailing how they use federated learning for the Gboard (Apple shortly followed suit) to effectively train local models on search queries without sending the entirety of users’ personal data back to their servers, which pulled me deeper into the space.

A phone personalizing the model locally, based on your usage (A). Many users’ updates are aggregated (B) to form a consensus change © to the shared model, after which the procedure is repeated.

In short: federated learning is one of many types of privacy-preserving machine learning- this approach specifically enables the users’ personal data to stay on their device (or, in other use cases, enables data to stay on servers) while the model is trained locally — only the model update is then sent to the cloud. Federated learning represents the potential of machine learning with the benefits of distributed power (and data ownership) among users. For a broader overview on the variety of types check out this explainer by OpenMined.

How Then, Shall we Live?

With questions of data privacy, who owns ones’ data, and the power that comes with that responsibility more relevant than ever, we believe that now is the right time for an enterprise company to be built, especially as the research surrounding FL has matured.

After the 2017 Google paper piqued our interest, we spent the next few years meeting with operators and researchers across data privacy and privacy preserving machine learning broadly. As we began to build our thesis on both the application and timing of the space, we started to see the first wave of Federated Learning companies pop up and seek funding. What we saw however was that many would align with our thesis on the diverse potential customer base for this technology, but eventually would end up in a narrow scope of fraud detection. Not a bad thing, but a sign for us that it’s perhaps still too early for the horizontal opportunity we thought was here.

(For broader notes on the future of compute architecture, see @mhdempsey ‘s “What kills Cloud Computing: A history of time shared computers and one device to rule them all”)

I like to daydream about the romantic ideals of the information structures of the future by examining the past. This is the ancient library of Alexandria, one of the largest libraries in the world.

Where the Future Lies

What makes us excited about this space is the amount of critical industries that have understandably been unable to adopt machine learning because of privacy concerns and sensitive data, whether it’s because of regulation (Europe’s GDPR & California’s CCPA), technical limitations, or concerns from stakeholders.

We think that because of some of the challenges faced by highly complex internal teams that are typical for government, pharma, or banking, sending forward deployed engineers out for the first year or two to gain internal understanding of these teams and accelerate product market fit holds a lot of potential (not dissimilar to the way Palantir approached working with the government). The administrative burden of having an appropriately dedicated engineer thinking about how to architect your application of federated learning means that the industries that are best suited for FL have a sufficiently high regulatory or other privacy related burden that means they’re both economically and structurally motivated to spend time implementing FL. We see finance, pharma, and government likely being the first movers in this space, with a long tail of possibilities across healthcare broadly and other industries.

Currently Doc.AI and Owkin are using FL with the intention of implementing cross device FL for medical research, Intel focused in on the FL for medical imaging space specifically. This piece lays out a simple framework for federated learning on vessel segmentation, if you want to try it out for yourself! This EU funded paper and research details the potential of FL for drug discovery virtualization.

Musketeer is pushing forward use cases in smart manufacturing and medical use cases. Nvidia Clara is a reference application for distributed AI training that’s designed to run on Nvidia’s recently announced EGX intelligent edge computing platform. FedAI, Devron, Decentriq, and Datafleets are also all generally focused on developing general enterprise federated learning platforms and frameworks.

Constraints, Challenges, and Open Questions

There are many different particular types of federated learning, and we’re excited to continue to read as the space deepens and solidifies, and as specific implementations popularize. (We’re often looking for teams in this space that are post academia or spinning out of a research group, so it’s always exciting to receive whitepapers of new research they’re implementing at their company.) We came away with a few core questions about the challenges of the space:

What unique challenges do the constraints of the devices the model is trained on present? With cross device federated learning, the devices that are gathering the data must be able to to train a model — there are also unique challenges around the various fidelities of data a variety of devices might collect, and the speed at which they all train the model so that they deploy the update to cloud simultaneously if necessary.**
What new, and likely under-researched security risks do FL systems represent? One of the open questions in this space is the potential to reverse engineer details about secure personal data from the high level overview that is sent to the cloud. A sybil attack, for example, represents some risk for FL. We’ll continue to follow along as the security research progresses in this space.
What level of parallel computing is possible? Current algorithms only work with device numbers in the 100s, hopefully this number will progress as algorithms do
How do we deal with non-IID data? The traditional statistical assumptions made with many ML models(ie that the data is independently and identically distributed) aren’t always ideal for federated learning and so how we account for or apply this to the ideal use cases is something we’re still thinking about. (This piece by DataFleets (a privacy preserving data engine) gives a great illustrative example of nonIID data if you’re not familiar). Edgify, for example, has proposed federated curvature, which adds a penalty term to the loss function, compelling all local models to converge to a shared optimum.
What more is possible for federated computing, outside of just machine learning?

It’s clear that privacy preserving ML, and federated learning especially, are a core part of the future we believe in — and we’re excited to play a part in it.

As always — feel free to tweet or message me questions, thoughts, disagreements, or pitches on twitter or at nicolewilliams@compound.vc

Appendix

** There’s a movement to better understand the tradeoff between communication costs since end-user internet connections typically operate at lower rates (Yuchen Zhang, John Duchi, Micheal I. Jordan, and Martin J. Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In Advances in Neural Information Processing Systems, pages 2328–2336, 2013)