How PyTorch enables medical breakthroughs with federated learning at Owkin

Published in

PyTorch

6 min readJul 26, 2023

An overview of AI BioTech company Owkin and how their FL framework Substra is used with PyTorch to enable drug discovery.

Owkin uses AI to find the right treatment for every patient. Their aim is to integrate the best of human and artificial intelligence to deliver better drugs and diagnostics at scale.

Owkin leverages PyTorch in combination with other technologies to build multimodal models that help researchers better understand complex biology through AI and discover new medical treatments.

One of the many innovations at Owkin is the use of federated learning to train more robust and representative models that facilitate scientific breakthroughs in a privacy- enhancing way. We’ll take a look below at some of their medical research powered by the open source federated learning software Substra, which was recently added to the PyTorch ecosystem.

How PyTorch helps Owkin in the fight against cancer

Owkin uses PyTorch to build models and pipelines for various medical research outcomes, such as accelerating clinical development, identifying new biomarkers, or building tools to help doctors diagnose patients more effectively. Owkin leverages PyTorch for most of their research projects; citing the flexibility and hackability of the framework as the main reasons. The design philosophy of PyTorch is even embodied in some of their own machine learning pipelines.

Case Study: PACpAInt

PACpAInt is a multi-step deep learning model recently co-developed by Owkin and Assistance Publique-Hopitaux de Paris (AP-HP) that decodes the complexity of pancreatic cancer, potentially revolutionizing the diagnosis and treatment of a disease with the lowest survival rate of all common cancers. It predicts tumor subtypes on surgical and biopsies specimens and independently predicts survival of patients. Without PACpAInt, molecular subtyping requires costly, lengthy, and complex RNA sequencing — which few patients can access. This level of analysis typically requires that the same data be manually analyzed by highly trained pathologists — which hospitals worldwide currently face a shortage of. AI tools like PACpAInt greatly increase the speed and accuracy of patient diagnosis, enabling doctors to tailor treatments more efficiently to the individual person.

How PACpAInt helps classify pancreatic cancer

Similar models built by Owkin such as HE2RNA help predict treatment responses and survival outcomes of patients based on H&E slides (medical image data). The code for this model has also been open sourced and is available here.

Owkin then uses federated learning, a privacy enhancing technology (PET), to scale the training of these models on larger volumes of data and more diverse datasets. The HealthChain project was featured in Nature Medicine as a milestone achievement for how federated learning can empower AI in the medical domain, connecting real hospitals training models to answer medical questions in triple negative breast cancer. Including diversity in data sources from the beginning, as opposed to only using additional datasets for validation, generates models that are more generalisable and less biased. Hence they can be more readily applied in real world settings and lead to better outcomes for patients.

What is federated learning

Federated learning (FL) is a decentralized machine learning procedure to train models using multiple data providers. Instead of gathering data on a single server, the data remains locked on local servers as only the algorithms and models travel between the data holders.

The goal of this approach is to build models that benefit from a larger pool of more diverse data as compared to a single source. Not only does this method result in increased performance and improve the statistical robustness of the model trained, it also allows data scientists and researchers to use data in a way that respects individual data ownership and privacy. You can check out the Substra space on Hugging Face to run a quick FL experiment if you’re interested in exploring an example.

An animation of Federated Averaging (FedAvg

The above graph shows how federated averaging works, which is one of many FL strategies. In this strategy, the model trains on multiple datasets simultaneously on training nodes and then averages the different models’ weights at an aggregation node.

What is Substra

Substra is an open source FL software developed by Owkin which is now hosted by the Linux Foundation for AI and Data. Substra provides a proven framework to securely power the training and validation of models on distributed datasets. It includes a flexible Python interface for easily integrating FL into existing machine learning stacks, but also comes with a web application to monitor and analyze the results of FL experiments.

Although Substra is machine learning framework agnostic and can be used with any framework on any data modality, it also comes with a special interface for PyTorch users. FL researchers often opt to use PyTorch due to its additional flexibility as compared with other tools, which is very valuable in FL due to the high amount of iteration involved in building models. You can find a Substra example leveraging PyTorch here.

Real world applications of PyTorch with Substra

Owkin is one of the leading companies in applied federated learning and have been working on projects since its introduction in medical research, including the flagship MELLODDY project which was a pivotal moment in the field. This was the largest ever pharma-industry AI collaboration for federated drug discovery, where Substra securely connected 10 pharma partners supported by 7 tech partners, enabling a collaboration between 100+ experts. The project contained the world’s largest collection of small molecules ( >10 million annotated) with known biochemical or cellular activity. This enabled more accurate predictive models in drug discovery, with results recently published in the proceedings of the AAAI conference on AI.

Due to the uniqueness of the project — where highly competitive companies were trying to collaborate on private business data — the best approach was to use federated learning. The partners chose to keep the head of their respective company models private while sharing a common body to build a base model. This was why PyTorch was selected by all the partners due to the flexibility of its building blocks.

To learn more about the project and the outcomes, you can check out the MELLODDY open source Github organization, where the repositories such as the MELLODD Tuner provide useful tools for understanding the work required in data preprocessing for federated learning.

PyTorch and Substra are also being used in the Bridge2AI project, which is a unique National Institute of Health-led collaboration in which different universities and researchers are seeking to collaborate on voice datasets to discover how voice data can be used as a biomarker. We’ll do a deep-dive on this project later this year to explore exactly how the PyTorch models used in the Bridge2AI project enable medical breakthroughs.

As FL research grows in popularity, Substra is also being deployed in some recently launched large scale healthcare projects such as OPTIMA and EUCAIM.

How to get started today

If you’d like to learn more about FL, the best way to get started would be to simply go through the PyTorch example on the Substra documentation.

Federated learning is still an evolving field and many different avenues for research are still open. Whether it be new strategies for how to federate centers or research in FL attacks, there’s a lot left to explore in this domain. Owkin also recently open sourced a collection of ready to use multimodal datasets called FLamby which helps evaluate different FL methods, which can be easily used in experiments.

Meta’s own FAIR team also invests in Federated Learning and provides FLSim as exploration in this space. It is exciting to see more projects and companies investing in Federated Learning and we look forward to collaborating more in open source moving forward.

Come join us on the Substra community on Slack if you have a federated learning project in mind or if you’d simply like to learn more.