Scaling Data Pipelines for High-throughput Bioinformatics

Saiful Khan
Elucidata
Published in
14 min readJun 18, 2024

Data pipelines are an essential tool in bioinformatics. They enable the standardization of data-cleaning and complex data-processing which in turn results in better reproducibility. Most data pipelines use some kind of framework for executing its tasks. Modern pipeline orchestration frameworks do a good job of abstracting and separating pipeline logic from infrastructure complexity. But that complexity remains to be managed at some level.

Latent Challenges in Pipeline Management

Before we talk about the specifics of scalability, let us first consider a simplified model of the lifecycle of a bioinformatics data pipeline:

Lifecycle of a Bioinformatics Pipeline

Bioinformaticians are primarily trained and skilled at developing algorithms, performing data analysis, and interpreting the biological implications of data. Their expertise is central to understanding and solving biological problems through computational methods. Therefore, they will be highly involved in the development and interpretation stages of the pipeline. However, it is not a good use of their time to be responsible for managing deployment and configuring execution environments as well.

Deployment and maintenance of pipelines in a cloud environment requires understanding of cloud services, configuration management, containerization technologies (like Docker) and CI/CD principles. These tasks are time-consuming and can divert bioinformaticians from their primary role – research and data analysis. Bioinformatics pipelines have highly dynamic storage and computational needs as well. Efficiently managing computational resources to handle variable workloads, optimizing costs, handling scaling issues, and ensuring data security involves skills generally associated with cloud solutions experts. Getting involved with this will be yet another distraction.

A Collaborative Solution

This is where Polly Pipelines, Elucidata’s managed pipelines hosting and execution service can help. You write the code, we take care of deployment and provide a fully-managed & secure infrastructure to run it. You can monitor your executions and finally download the output and reports for further analysis.

Polly Pipelines manages deployment & execution stages completely

Providing such a service in a multi-tenant (or even a single-tenant) environment is not easy. We need to support highly concurrent workloads w.r.t. both development and execution of pipelines. In this post we will talk about our pipeline deployment and execution architecture to cater to the required throughput and concurrency.

Scaling Pipeline Deployment with Polly

The deployment of pipelines is often the most boring part of its development phase. It involves compiling or packaging the code, building and pushing execution containers, configuring environment variables and registering or updating the pipeline’s metadata with the orchestration framework. It is a multi-step process that is automated in most production environments.

In this section we will describe how our unique deployment strategies enable teams to scale both the number of pipelines and the number of contributors without being choked by operational blockers.

Isolated Deployment

Polly Pipelines gives us the ability to host all our pipelines in a central pipelines repository. This pattern can be replicated for any number of organizations, although we have not done it outside of Elucidata yet. Enterprises often adopt a “monorepo” approach to help centralize code governance as well as maintain common utilities & coding standards across pipelines. However, such repositories come with a major challenge – coupled deployment of pipelines.

At any given point of time, different pipelines in the repository can be in different stages of development. If we deploy the repository’s code because Alice’s Pipeline-1 has finished adding a new feature while Bob’s Pipeline-2 is still testing out a new feature, then we have Pipeline-2 being deployed to production with unverified behavior. This is non-ideal. Trying to manage this situation through people processes, often results in deliveries of one pipeline being blocked by another.

This is why we have carefully designed our continuous-deployment workflow such that each pipeline’s code (along with its dependencies) and container image are deployed in isolation from each other. If a developer makes changes to a pipeline and commits it to a live branch, then they are presented with a prompt for approving the deployment of that pipeline. Therefore, if Alice and Bob make some changes to Pipeline-1 and Pipeline-2, respectively, and commit them to the same branch, then they will be presented with two separate approval holds. Alice will simply approve only Pipeline-1’s deployment, leaving Pipeline-2 unchanged in the live environment.

Isolated deployment ensures that only intended changes are deployed

If all pipeline developers practice this simple approach – approve only the pipelines that they have made changes to – then it eliminates the need for any kind of manual coordination between members of a team or between members across teams.

Multi-stage Deployment

While we support and recommend local testing of pipeline code before pushing it to the main repository, some pipelines can simply not be tested offline on the developer’s local machine. This is most often due to a pipeline’s resource-intensive nature. Additionally, many teams also need a way to stage a new version of a pipeline in “testing” mode before moving it to production. This pre-production stage is where the pipeline may be used by internal peers or QA teams to make sure that the pipeline’s behavior satisfies the acceptance criteria set by the stakeholders.

To help with these requirements, Polly Pipelines' registry supports three stages for each pipeline. The repository has three live branches, one for each stage.

Each group (devs, QA specialists & end-users) gets a separate deployment

Merging a pipeline’s feature branch to the develop branch deploys the pipeline in “dev” mode. When executed, a pipeline in “dev” stage will be run on the development environment. The development environment is completely isolated from other environments. It supports small-scale execution of pipelines in a true cloud environment. It is ideal for testing the pipeline behavior during development.

Merging a pipeline's feature branch to the staging branch deploys the pipeline in “test” mode. When executed, these pipelines will run in a “test” environment, once again isolated from other environments. This environment is useful for providing a pre-production replica of the pipeline for verification by peers and QA team members.

Finally, once the team is confident that the pipeline is ready for production, they can merge their feature branch to the master branch, which will deploy the pipeline to a main production environment. This environment, while isolated from the dev and test environments, has all the compute capacity that Polly Pipelines has to offer, which is what we will discuss next.

Scaling Pipeline Execution with Polly

Although bioinformatics pipelines are specialized and adapted to the unique demands of biological data analysis, they can still be considered as a type of ETL workload:

  1. Extract – This often involves retrieving raw biological data from various sources, such as publicly available sequencing data (GEO, etc.), public databases (like GenBank, ENSEMBL, etc.), or private lab-generated datasets.
  2. Transform – This is the main analytical part of the pipeline. It includes a series of computational steps and tools that process raw data into meaningful insights. This might include quality control, sequence alignment, variant calling, gene prediction, and other bioinformatics analyses.
  3. Load – Finally the results, which could be annotated expression data, models or analytical reports, are often stored in repositories and databases for further use and interpretation. Optionally, this output may itself be input to yet another pipeline for other specific use cases.

We are constantly innovating at each phase of the ETL. This section describes how we are solving the most challenging issues encountered by practitioners when running their bioinformatics pipelines.

Seamless Data Import

One of the bigger challenges of the data extraction phase is the migration of data from different sources. Each source will have its own authentication mechanism. Additionally, a pipeline that is run often for different customers may need to use different source of the same type (e.g., different S3 bucket) in each run. Managing the code and authentication logic for so many disparate sources for each pipeline is a minor headache for the developers.

Fortunately, Polly Pipelines comes with the ability to automatically import data from a variety of sources and make it easily accessible to your pipeline runs. In short, the user flow looks like:

  1. The user opens the pipeline execution form.
  2. For parameters that can accept files or folders as inputs, the user opens a file browser.
  3. If the source of files or folders has not been used previously, the user registers the new source with necessary authentication details. For security purposes, we may invalidate auth tokens at pre-defined intervals, after which they must be updated again by the user.
  4. Once the registered, the user can browse files or folders from their source of choice. These sources remain available for all future runs.
  5. The user starts the pipeline run and behind the scenes we bring in the specified data into an S3 bucket path accessible to their pipeline run. Each run can only access data imported for their run.
  6. After the import is completed, the pipeline’s actual execution begins, where they can simply do an aws s3 cp on the path (passed in the place of the file or folder parameter).

This approach makes writing pipelines a lot simpler for bioinformaticians. They no longer have to manage dozens of secrets and authentication mechanisms. They can focus on the transformation processes of their pipeline instead.

The import feature does all the heavy-lifting of your data extraction process

For pipelines where multiple (potentially large) files are needed, we enable data import on a fleet of serverless workers. This greatly reduces the amount of time it would take to import large batches of files or folders.

As of this writing, we are starting with Amazon S3, Polly Workspaces and Illumina BaseSpace as our first few sources. However, the Polly Pipelines architecture ensures that we are agile in bringing support for new sources.

Uniform Interface Across Executors

Most data pipelines are defined in a DSL or framework specifically crafted for that purpose. One such DSL is Nextflow. It has become increasingly popular in the scientific community due to its flexibility, scalability and ability to streamline complex computational workflows. For this reason, we also chose to go with Nextflow as our primary executor for all bioinformatics workloads.

However, since then, we have realized that not everyone will want to use Nextflow for their needs. For example, within Elucidata, we have data engineers who need to automate secondary or tertiary transformations & loading of processed data. They prefer a python-native stack and do not need any bioinformatics specific utilities. To cater to their use cases, we created our own pipeline executor called PWL (which stands for “Polly Workflow Language”) written in Python.

At the same time, we often talk to customers who have already written and verified their data pipelines internally. Their choice of the workflow language could be something else like Snakemake or WDL but we still want them to benefit from all the features and scalability that Polly Pipelines has to offer.

Very soon, it was evident that to support the diversity of skilled professionals we will need a multi-framework platform. Therefore, we have carefully crafted programmatic and graphical interfaces without letting one framework dictate any of our choices. These interfaces will allow us to bring in new pipeline executors based on the demands of customers.

Polly Pipelines’ API abstracts all executors behind a uniform interface

UX patterns are a neglected aspect of scalability. We firmly believe that providing a unified interface for running pipelines in any executor is the best way to serve the diverse and rapidly evolving field of bioinformatics.

Horizontally Scaled Execution

Modern bioinformatics projects often deal with massive datasets, such as genomic sequences, proteomic data, and large-scale imaging data. Scaling out allows the distribution of these large datasets across multiple computational nodes, facilitating efficient processing and analysis. This scale out can happen on any given axis based on the pipeline logic (most commonly “tasks” or “processes”). For example, transcriptomic datasets often come with multiple samples. The processing of each sample can be pushed a separate compute node, thereby significantly reducing the overall time required to complete the analysis. This is crucial for time-sensitive research.

Polly Pipelines automatically implements this scale out pattern based on the definition of your workflow. Any tasks that can be run parallelly will be run as such without the pipeline developer having to specify anything other than how compute intensive their tasks are. A child task only moves forward when all its parent tasks have finished executing. So the infrastructure is built in a true MapReduce fashion, making the most effective usage of available compute.

And since we are availing compute from large cloud providers, our reserve is potentially limitless. However, in practice we do put certain restrictions on the number of tasks that we run parallelly. This is purely a cost-control mechanism and we will increase this limits should we observe that workloads are reaching our maximum capacity.

The pipeline execution infrastructure scales relative to the demands of running jobs

To give a sense of the current scale of our infrastructure, we have estimated that Polly Pipelines can successfully process 5k GB-samples of transcriptomics data per week through our custom RNA-Seq pipeline, which uses kallisto for (pseudo-)alignment & quantification. In simple terms, this means we can process 5000 samples weekly where each sample is 1 GB in size.

Apart from raw compute power, we ensure that each executor has its own priority queues and execution clusters. Not only that, each stage of a pipeline gets its own execution cluster as well. Pipeline developers can expect each of these clusters to exhibit similar scalability semantics.

Bottomless Storage per Node

Bioinformatics pipelines have high storage capacity needs due to the nature of the data they handle and the computational processes involved. Each process may itself generate more data for the next process to be used. Moreover, this I/O happens on “local” storage of the nodes participating in pipeline execution. It is difficult to estimate how much storage a given pipeline run needs. It is difficult still to provision disks dynamically based on each run’s demands. So jobs frequently used to fail with out-of-disk errors. We therefore had to come up with a new design for the storage attached to compute nodes.

All compute instances in the cloud come with block storage of some kind. Either it will be a disk directly attached to the instance or a volume connected over the local network. These volumes have a fixed capacity. If a process continues writing more data to the volume, it will eventually reach the volume’s capacity. Normally, we would expect the process to fail at this point. However, if we are running our process on Polly Pipelines' infrastructure, the user will never really find out whether they ran out of capacity. Internally, as soon as the instance’s disk space is exhausted we switch to a remote volume (using the NFS protocol). This remote volume can store an extreme amount of data and our processes effectively never run out of disk space.

A tandem arrangement of local & NFS volumes maximizes storage capacity

The caveat here is that the NFS volume has a maximum throughput limit. If too many nodes are writing data concurrently to this volume, we may see slowdowns in pipeline runtime. To counter this effect, we simply provision more NFS volumes based on the load on our execution cluster. Tasks or processes from the same pipeline run, however, do share the same NFS volumes.

S3 for Intermediate Data

Data on disk is not the only produce from pipelines. Any non-trivial pipeline will have multiple tasks that execute in parallel as well as in succession. Each successive task needs the output from its parent task. So we need a data-exchange layer for storing this intermediate data. Finally, once a run completes, the output is generated which also needs to be saved durably.

For both of these needs we use Amazon S3. Intermediate data can be written by hundreds of tasks simultaneously, totalling terabytes in size. Most traditional storage devices will not able to handle this kind of write throughput. The same goes for reads as well. Amazon S3, being a distributed file-system, has no problem under this kind of load. It is built precisely to handle large number of concurrent read and writes. Moreover, it provides eleven 9s of durability guarantee.

Similarly, we also store the pipeline’s final output in S3. But based on the instructions of the pipeline, the developer is free to upload that data wherever they want and in whatever format they want.

So far we have never faced an issue while using Amazon S3 for any of our pipeline runs, in terms of concurrency as well as durability.

Network and Security in Polly Pipelines

One final aspect of Polly Pipelines worth highlighting is security through network isolation. Bioinformatics often involves dealing with sensitive data. It is imperative to make sure that pipelines running on the cloud, especially in a multi-tenant systems, are operating with strict access controls in place. Therefore, we make sure that:

  • Pipelines runs remain unaware of other pipeline runs. They can never access input, intermediate data or output from other pipelines. Even logs generated by the pipelines are kept in dedicated log streams.
  • Once your data is imported into the execution system, it never leaves our private network. Of course, the pipeline developers can still choose to push it to other remote data stores should they wish to do so.
  • For production environments, all Elucidata personnel (except system administrators) are restricted from viewing or downloading any data processed or generated by the pipeline execution.
Polly Pipelines enforces strict access control and data isolation for each run

We also understand that there are organizations that deal with PII data and have to follow stricter security policies. We recommend them to go with a single-tenant (enterprise) deployment of Polly Pipelines. This way, their data and their compute is completely owned and governed by the organization.

Future Work

There is still much work that we want to do to make sure that Polly is the best place to develop and run bioinformatics pipelines. A few ideas that we are currently entertaining are:

  • Enable a bring-your-own-pipeline journey – As mentioned previously, as of this writing, all pipelines we run are managed in a central repository by us. But often bioinformaticians would want to control their own code and SDLC for pipelines. Letting them upload their pipeline code & register it through our programmatic interfaces (library or CLI) will help us achieve that outcome.
  • Support more executors – We plan to support more pipeline definition and orchestration frameworks, based on popular demand. The one we are currently ideating is Snakemake. Apart from Nextflow, it is probably the most widely used workflow management system in bioinformatics.
  • Improve local storage throughput – Because of the shared nature of our NFS volumes, they are prone to becoming points of slowdown for our pipelines. Worker nodes will divide the overall throughput amongst themselves and depending on the number of nodes, their read/write speed can suffer. To counter this, we want to further introduce a storage layer for our nodes such that the I/O throughput remains independent on the number of concurrent workers. We have some ideas, but we will need a whole another post to talk details.

In this post we have shown how Polly Pipelines is a versatile, scalable, and developer-friendly platform for writing bioinformatics pipelines. Its rich feature set and attention to every stage of a pipeline’s lifecycle makes it a good choice for bioinformaticians.

--

--