Building Biomedical Data & ML-ops Platform: From Collection to Discoveries

Published in

Elucidata

7 min readMay 14, 2024

Life science research presents a series of hurdles from the moment you collect your data to the moment you gain meaningful insights. Biomedical data is messy, and complex, and comes in all shapes and sizes. Turning this data into groundbreaking discoveries takes some serious work, and each step has its own hurdles.

First, we gotta clean the data… think about cleaning typos, fixing inconsistencies and ensuring everything speaks the same language.

Then we gotta annotate it with metadata… adding content to data to give it meaning and relevance.

Next, a huge problem is to do this at scale, which demands robust systems… to ensure it’s organized, accessible and secure.

Just when you think you’ve weathered the storm, there’s the complex task of running machine learning models or statistical analyses, each with its own set of intricacies and nuances to navigate.

Challenges Associated with Data Cleaning & Data Harmonization

At Elucidata, our extensive collaboration with pharmaceutical companies of varying sizes spans numerous years. Over this period, we have undertaken the tasks of ingesting, processing, curating, and securely storing vast quantities of both public and proprietary data. Our efforts have empowered clients to extract invaluable insights from this data. Throughout the evolution of our ML-ops platform, Polly, we have encountered and recognized key challenges that present significant hurdles in data processing, management, and insight generation.

Let’s delve into an overview of these challenges below:

Scalable, cost-effective data processing
Ensuring high-quality data curation
Effective data storage and management

Let’s look at it, one by one!

Data Processing: Going from Solo Server to Supercharged Chaos!

So, you’ve got your pipeline humming along nicely on your trusty single server or personal computer. It’s all smooth sailing until you start dreaming big — like, a thousand times bigger. Suddenly, you’re facing a whole new set of challenges that make your initial setup look like child’s play. Sure, setting up your own pipeline seemed pretty straightforward at first. But when you try to replicate that magic on a much grander scale, things start to get interesting. And by interesting, I mean complicated.

One of the first speed bumps you’ll hit is infrastructure scaling issues. Your once-mighty server is now struggling to keep up with the tsunami of processing demands. And let’s not forget about storage. As your datasets balloon in size, storage limitations rear their ugly heads. Suddenly, you’re playing a game of Tetris with your data, trying to cram it all into whatever space you have left.

To tackle these challenges head-on, you’ll need to roll up your sleeves and get down to some serious optimization at the infrastructure level. We’re talking about fine-tuning every nook and cranny of your setup to squeeze out every last drop of efficiency. And let’s not forget about the elephant in the room — cost. As your operations expand, so do your expenses.

Solution? Polly Pipelines

Polly Pipelines represents a specialized workflow orchestration system tailored to the complexities of biomedical data processing. Given the vast and intricate nature of biomedical data, a pivotal aspect of the system is its capacity to seamlessly expand in computing power and storage capabilities to meet escalating data requirements.

For a visual overview of the high-level architecture of Polly pipelines, refer to the diagram below:

Some key features of Polly Pipelines:

Flexible Scripting: Craft your pipelines with your choice of language — Polly Pipelines supports Nextflow and Polly Workflow Language (PWL). We have plans to support Snakemake too in the future!
Elastic storage: No need to pre-allocate storage while running your pipelines! Our elastic file storage seamlessly adapts to handle massive datasets and is performant
Cost-effective processing: Leveraging Spot instances, the infrastructure reduces expenses by as much as 70%, with fail-safe mechanisms in place to handle instances reclaimed by AWS.
Effortless scalability: The infrastructure dynamically adjusts its capacity to accommodate heavier workloads, including intensive pipelines and parallel executions, scaling down to zero when idle.
Data Importers: Effortlessly import data from diverse sources like S3 and Basespace, streamlining your workflow.
Resume post failure: The infrastructure automatically saves the pipeline’s state, allowing it to resume from the last saved checkpoint in case of failures.
GUI and Programmatic Interface: Choose between a user-friendly graphical interface or a powerful programmatic interface (Python SDK) for automated workflows.
Clear Visibility: Gain complete insights with our dedicated monitoring dashboard. It visualizes detailed logs for each pipeline run, ensuring transparency and smooth troubleshooting.

High-quality Data Curation at Scale

Welcome to the world of life science research, where knowledge is endless and data is everywhere. In this busy field, organizing data is crucial. Think of it like sorting through a huge library to find the right books. Detailed information about each dataset helps scientists find what they need quickly. Structured and annotated metadata emerge as the unsung heroes, empowering researchers to navigate with precision and clarity, facilitating seamless retrieval and utilization of invaluable data resources. To do this, researchers resort to curating their data.

But, curation is hard.

Especially for biomedical data. Biomedical data, with its genomic, proteomic, and clinical types, presents unique complexities for curation. Each type has distinct formats and standards, making curation challenging. Research projects often need customized data solutions, viewing curation as a bespoke process. This precision-driven approach extends curation time, demanding significant effort. Additionally, quality concerns like missing values and errors require meticulous attention and rigorous quality control.

Solution? AI-assisted Curation on Polly

High-quality data is paramount for deriving accurate and meaningful insights. While manual curation ensures superior quality, it is hindered by its time-consuming nature and lack of scalability. On the other hand, automatic curation offers scalability but may compromise on accuracy compared to manual curation. Recognizing the need for both efficiency and precision, Polly curation adopts a hybrid approach. Data is first auto-curated using AI models and then curators verify the curation done by the model. By leveraging artificial intelligence to aid curators, Polly expedites the curation process while upholding rigorous quality benchmarks, such as adherence to FAIR data principles, ontological accuracy, and comprehensive coverage. This hybrid model combines the strengths of manual and automatic curation, ensuring optimal data quality and scalability for enhanced insights.

Effective data storage and management

How data is stored profoundly impacts the depth of insights it yields. Even with access to top-tier data, finding answers to all inquiries isn’t guaranteed. To excel in this arena, two key components are indispensable:

A superior data model adept at organizing the wealth of information at your disposal.
An efficient storage infrastructure capable of accommodating vast volumes of data, while also functionally equipped to aid in discovery and analysis processes.

Solution? Polly Atlases

Polly Atlas serves two primary functions: Firstly, it aids in the organization and management of data. Secondly, it facilitates the analysis and exploration of data in optimal ways.

Bioinformaticians rely on a diverse range of tools and methodologies to extract meaningful insights from biological data. For example, a bioinformatician exploring gene expression patterns under environmental stress conditions requires access to a search interface capable of retrieving relevant datasets matching specific experimental parameters and associated keywords. Similarly, in another scenario, bioinformaticians may develop machine-learning models to identify disease biomarkers. These are just a few examples of the many potential applications. Polly Atlases offer flexibility to accommodate numerous use cases, enabling users to effectively consume data and derive insights.

Polly as an ML-ops Platform

Polly’s robust infrastructure makes it an excellent ML-ops platform for curation, QCing, and for downstream consumption. Its architecture is designed to streamline the entire machine learning lifecycle — from data preparation to model training, deployment, and monitoring foundational models across your harmonized data for further downstream use cases and analyses- like patient stratification, meta-analysis, biomarker predictions, target identifications, and more.

Drive successful ML initiatives for 75% faster insights and unlock deep biomolecular insights from your harmonized data. Work with our domain experts to perform metadata-based exploration, differential expression, build knowledge graphs, develop interactive dashboards, etc., to deep-dive into data for robust insights.

To conclude, I’d just say that the challenges discussed above are just the tip of the problems. A well-established biomedical data platform would require many more components that are not discussed as part of this blog. We have not even covered the data security and governance part of such a platform. Maybe an idea for the next blog?