Genomics Core: Accelerating genomics workflows in cloud

Published in

genomics-healthcare-systems

15 min readJun 10, 2024

First, a little disclaimer!

Since 2020, the laboratory where I was working was acquired by a famous and one of the largest hospital complexes in Brazil: Hospital Israelita Albert Einstein (HIAE). For the first two years, my goal was to guarantee the successful migration of all our bioinformatics workflows, laboratory systems, and staff to the hospital’s internal processes and teams. At the end of 2021, I joined the bioinformatics team at HIAE named VarsOmics.

The initiative was developed in a partnership between the innovation lab of the Albert Einstein Israelite Hospital and bioinformatics specialist Murilo Cervato, from which Varstation was born, a platform for human genetic analysis that automates the complex and manual process of analyzing genetic data generated from Next Generation Sequencing (NGS). With the expansion of its activities into other segments, Varsomics emerged, consolidating an evolution that involves developing new products focused on genetic analysis and precision medicine.
At the end of 2022, I received an invitation from the manager Murilo to coordinate the teams of bioinformatics, data science, and cloud. We had huge challenges with several staff exits and rebuild the teams from ground zero (especially data science and cloud).
My next series of posts will focus on some projects and products that I worked on with several teams and people in Varsomics from 2022 until the end of 2023. I will share without exposing any confidential and sensible data the problems and solutions that we needed to solve. I hope you enjoy!

Introduction to the Problem

Genomic data processing typically uses a wide set of specialized bioinformatics tools, such as sequence alignment algorithms, variant callers and statistical analysis methods. These tools are run in sequence as workflow pipelines that can range from a couple to many long toolchains executing in parallel.

As any new startup we started with Bash, Perl or Python scripts to orchestrate our pipelines. But as our pipelines have gotten more complex , maintainability, reproducibility and specially the requirements from growth (more samples sequenced in parallel) the requirements was becoming higher, and the need for specialized orchestration tooling and workflow definition has grown significantly.

Volume , preasure for fast growth and tight deadlines for the exams demanded automation and scalability.

Another challenge was to migrate our pipelines executed on on-premise servers to the cloud computing backend. With the flexibility of provisioning compute and storage resources on demand in 2020, we scaled-up our production workload to Amazon Web Services (AWS) to support our new pipelines demands, with focus on saving time, reducing costs and use the software-based pipelines accelerators for sequencing analysis with Illumina's Dragen platform available in AWS.

Genomics in the cloud: pros and cons

Going to the cloud is not always an obvious choice. Many questions regarding about security, scalability, costs, turn-around processing and resources availability must be taken into account before migrating your pipelines to a public cloud. Here some points why we have chosen to migrate to the cloud.

PROS

Scalability: Greater scalability than on-premise platforms, the ability to dynamically allocate and deallocate resources according to the demand and availability. We came from an on-premise infrastructure, where we had limited and fixed resources, and new hardware request would require a long process to approve and make it available for the team.
Flexibility: The cloud-based environments can easilly install and update the latest version of the tools and libraries and easily migrate to new family hardware instances and operational systems. Another improvement is to use several tools and libraries available at the cloud as a service ou third-party marketplace software.
Security: Both cloud and on-premise platforms can provide granular control and protection over the data and the analysis processes. Nowadays it's more about the security governance rules and standards applied by your company. Does your company have access control standards ? Does your cloud team follow the best practices and standards for data security and privacy? How the data is shared with other researchers and teams ? Both environments have solutions to mitigate these issues, but it will require knowledge , infosec culture and time and money effort.

CONS

Internet Dependency: Sequencers are still on-premise devices. So if you decide to process your data in the cloud, you must have a stable Internet connection. Laboratory facilities may face challenges if they have limited or unreliable internet access, which can affect their ability to work on bioinformatics tasks in the cloud. Also look after providers with Direct connect solution. Direct connect is a dedicated, physical fiber interconnect to a cloud provider within a colocation facility. Well-configured It will give you benefits of reducing total operational costs, security, performance (low latency) and reliability.
Data Transfer: One significant concern in the scenario of cloud computing is the transfer of raw genome data out of the cloud provider. This concern stems from both the costs associated with data transfer and the complexities of data processing. Given the vast amounts of genomic data generated in research, transferring this data out of the cloud can incur substantial expenses, especially when dealing with large-scale projects with clients and labs which have different infrastructure and cloud providers. You must carefully consider strategies for reduce or eliminate the data transfer out cost, like offering genomics data lake solutions, multi-cloud federation, or make this cost part of your business plan with your clients.
Data Compliance and Regulations: Depending on your regional data compliance regulations and industry standards, e.g. storing patient and genomic data in compliance with regulations on a cloud provider that specializes in healthcare, it might require an extra effort about how you store the data (frequent access, deep archive, infrequent access, etc) , how long it must be kept due to regulations , and the data anonymization (how you keep your genomic data findable but not identifiable). Your cloud provider can help you with several services and tools to quickly achieve these regarding points above, but it will come with investment in a infracloud team with experience on these skills to help you in these tasks.
Costs: Costs is always the word when hear about choosing going to the cloud. It is true. It can escalate rapidly where there's a lack of a robust financial operations (Finops) and control. Without proper oversight and management, the expenses incurred from cloud computing can jeopardize your budget allocations especially given the computational intensity and sheer volume of genomic data. You must be prepared and setup correctly your cloud environment in order avoid unpleasant surprises.
Provider dependency: The choice of a specific cloud provider can lead to vendor lock-in, where research workflows become fully integrated with the services and resources of that provider. Transitioning to another cloud provider or reverting to on-premises infrastructure can be challenging due to differences in architecture, tools, and services.

You might be inclined to go away from the cloud, but depending on your budget, infra-cloud team skillsets and the volume of sequencing you will be running in a short period of time, consider the cloud platforms as a first choice. The headaches of expanding your datacenter and purchasing servers, storage and backup services can take a lot of your time focusing on IT infrastructure instead of the business value: sequencing and genomics processing and building/validating bioinformatics pipelines for new sequencing tests.

A simplistic Einsehower risks and challenges matrix where we categorized the pros and cons based on benefits and challenges.

For us, in VarsOmics, migrating to the cloud was a strategical movement specially due to our analysis platforms that was running in the cloud. With the volume we were receiving from public healthcare genomics sequencing projects and big labs, the scalability and flexibility had more strength in our decision matrix.

Presenting Cromwell

One of the important components in the bioinformatics workflows is the orchestration tool responsible for the workflow execution. It is his job to orchestrate the command line and containerized tools and capable of reading and interpreting scripting-language files written by the bioinformaticians. We invested a long time to decide which orchestration tool we would use in VarsOmics. We have tested Snakemake, NextFlow , Cromwell , even AWS Step Functions. After many hours of discussions and POCs, we have chosen Cromwell.

Cromwell Server is responsible for parsing and execution of the WDL scripts created by the bioinformaticians by delegating to a underlying infrastructure (cloud or local) with the inputs and outputs persisted as results after the pipeline completed.

Cromwell, developed by the Broad Institute, is specifically designed to address the orchestration tasks. It serves as a workflow execution engine, orchestrating both command line operations and containerized tools. Crucially, it powers the GATK Best Practices genome analysis pipeline. Workflows in Cromwell are delineated using the Workflow Definition Language (WDL, pronounced “widdle”), a versatile meta-scripting language. This enables researchers to concentrate on the essential components of their workflows, such as the tools for each step, along with their respective inputs and outputs, without having to worry about the underlying infrastructure details. Furthermore, comparing to the other orchestration tools, it has a self-hosted API server and a database to store the workflow executions metadata, enabling integration with our services and components.

The first wave

As we started to plan and develop our first genomics pipeline in the cloud, we faced some issues that we required our attention and how to solve it.

How do we host Cromwell in AWS backend?
How do we provide the computing requirements and resources to execute the jobs submitted?
How do we deal with tasks failures ? Tolerance fault ?
How do we persiste the inputs and outputs that are required to process the each task ?
How do we guarantee that our data and pipelines are unreachable from outside? And the data does not have traffic on the internet?
How do we get notified when the Cromwell workflows fails or executes successfully?

How do we host Cromwell in AWS backend?

The good news is that Cromwell since 2018 is officially supported in the AWS Cloud. The Cromwell server with AWS backend implementation can submit the tasks as jobs to AWS Batch queues and read objects in S3 without downloading it. We use a Cromwell-AWS fork with some AWS-specific optimizations such as call-caching disabling (guarantee the job from recomputing the start, avoiding copying previous S3 outputs that could be incorrect), limiting concurrent workflows and customize the AWS Batch Retry Attempts parameter in case of task failures.

How do we provide the computing requirements and resources to execute the jobs submitted?

AWS Batch dynamically provisions the optimal quantity and type of compute resources (for example, CPU or memory-optimized instances). Provisioning is based on the volume and specific resource requirements of the batch jobs submitted. This means that each step of a genomics workflow gets the most ideal instance to run on.

In our first versions of our architecture, we created EC2 launch templates with job queues and compute environments based on the pipelines needs. For instance, we can have a priority job queue pointing to an on-demand compute environment with m5 instance family type or we can have a default job queue pointing to a compute environment with spot instances. We can create other job queues with compute environements based on the needs. We even have job queues to support the Dragen software which demands high scale processing. In. this scenarios we work with f1 instance family types which includes FPGA hardware.

For the task management, we use the Elastic Container Service (ECS), which is an AWS service for managing and orchestrating containers in the cloud. The container instantiated by that image is defined in an ECS task definition along with runtime requirements, environment variables, IAM permissions, CloudWatch Log group name and number of desired tasks.

Cromwell also supports Docker container images which contain the tools required by the workflow (e.g. BWA, GATK, fastQC, etc.) . All these images we store it in private Amazon Elastic Container Registry (Amazon ECR) repositories secured for access by Genomics Core workflow runs.

How do we deal with tasks failures ? Tolerance fault ?

Cromwell works with AWS Batch backend and it provides a method for tackling transient job failures. So if a task fails due to a timeout , it can re-run the failed task without having to re-run the entire workflow.

How do we persist the inputs and outputs that are required to process the each task ?

We use AWS S3 to store the static output data produced by Cromwell, the WDL workflows or the workloads. We can persist the data even when the all processing nodes are complete.

Another challenge is how to deal with jobs that localize very large files from S3 or produce large amounts of intermediate files, specially when multiple jobs are running in the same instance, since all containers share the same disk. The solution is the service Amazon EBS Autoscale, that attaches EBS volumes to an EC2 instance automatically in response to disk usage. The EBS Autoscale constantly monitors the percentage of used disk. Once it reaches a defined threshold percentage, new disks are attached. That continues until the job finishes or the maximum limits are reached.

How do we guarantee that our data and pipelines are unreachable from outside?

Our architecture uses Amazon Virtual Private Cloud (VPC) , an isolated network in an AWS environment, as a security best-practice. We work with private subnets so can allow the solution to connect to the internet (or anywhere outside the VPC), but it still remains unreachable from the outside. To avoid unpleasant costs, we use VPC endpoints , so any traffic between Amazon VPC and AWS services does not leave the Amazon network.

How do we get notified when the Cromwell workflows fails or executes successfully?

Our architecture implements notifications for integrating components of the architecture and make it easier to report Cromwell workflows status. The key components are Lambda functions responsible for extracting the pieces and assembling user-friendly messages and the Amazon SNS, which is an AWS service that provides real-time notifications between components of applications using a publish/subscribe method.

One of the applications we use is the Slack Channels , which we created for our support and bioinformatics team to monitor processings and failures. Cromwell notifications carry metadata about the sample being processed, such as name, submission ID, workflow name, error messages and the URL to the sample in the applications administration website.

Screenshot of our notifications pushing to one of our Slack Channels. Using AWS SNS + Slacks Channels API Webhooks we can send the Cromwell status and result logs .

The final architecture

The following image is the draft of our first architecture of the Genomics Core. We ran with this architecture for some months until we faced fast volume increasing demands, such as the Brazilian rare genomes sequencing project Genomas Raros, with demands of ~=1000 genomes to process in the cloud with 24hrs turnaround time.

The second wave

As more tasks were submitted to Cromwell, Cromwell started to suffer of timeout errors and errors code due to high memory usage. Issues that we saw at the official Cromwell repository showing up more frequently. What we created to workaround this was a Network Load Balancer (NLB) + AWS Fargate. It is attached to the ECS cluster running the Cromwell service and provides a DNS name that can be used to send requests to the Cromwell’s API and retrieve metadata. It defines a health check that runs in specified intervals on a defined path (for Cromwell, engine/v1/status) and expects success return codes. If Cromwell returns an error code, the health check sets the task as unhealthy and another one is deployed to satisfy the desired healthy tasks number.

Genomics Core uses the Fargate deploy mode for Cromwell, where only the containers’ configuration are taken care of, leaving to AWS the provisioning, management and configuration of the EC2 instances that will run them. The goal is to guarantee that any issues we have with Cromwell process, terminating unexpectedly or not responding , the Fargate comes to action and restarts the container.

We stressed out our solution with research projects up to 1000 samples to process in the cloud with success as we present below some evidences with this huge conquest for us!

A screenshot of the AWS Batch during one Genome Processing projects at the time. More the 100 jobs terminated successfully and some failed (due to wrong parameters or spot instances AWS requests).

Costs are also a ghost that hunts us, specially with large scale volume demands. By using Spot instances with AWS Batch we could save up to 40–50% on our compute costs when compared to on-demand pricing. This comes of course with some drawbacks that it worths to point out:

The runtime for your jobs is typically 60 minutes or less.
You can tolerate potential interruptions and job rescheduling as a part of your workload.
Long running jobs can be restarted from a checkpoint if interrupted.

The final architecture is presented at the image below.

Our final architecture for Genomics Core. It includes the Network Load Balancer (NLB) and AWS Fargate to manage our ECS Cluster of Cromwell instances. The AWS Batch comes with job queues to support Spot EC2 instances. We host our private bioinformatics tools images in Amazon ECR.

The calm sea

Gather Metrics Workloads

With our architecture validated and in production for some time, we put some time in metrics ingestion from our solution . There were several key metrics that could be collected associated with Cromwell job submissions, EC2 metadata from the instances and costs from cost explore. All these metrics together available so we could it make them available for our data analytics tools.

To extract batch jobs metrics such as time to start, execution time, and others we use CloudWatch log stream containing the log of the job processing. This is triggered by specific events in the default event bus from AWS EventBridge. With jobs terminated with status SUCCEEDED or FAILED we have lambda jobs to collect and send this raw data in our S3 landing buckets.

Draft of our Gather Metrics workload for AWS Batch Jobs metrics ingestion. We use EventBridge to trigger specific jobs due certain status changes and AWS CloudWatch to get the log streams from the job processing.

The same logic is applied to extract EC2 metadata such as ID, family type, launch and termination time and others. We collect all this data and put it as JSON files in our S3 landing buckets.

Draft of our Gather Metrics workload for AWS EC2 metrics ingestion. We use EventBridge to trigger specific jobs due certain status changes and query with functions like "describeInstances" to get the instance info from the job processing.

In adittion, Cromwell submission metadata is also available with Cromwell API. Since the server is usually in a private subnet, users have no direct access to the Cromwell API to get the metadata of a submission. We developed a lambda function to access Cromwell, extract the metadata and save it as a JSON file in S3. It is triggered by an SNS topic that reports terminated submissions, no matter if Succeeded, Failed or Aborted. Using S3 allows local access to the metadata without querying Cromwell’s database directly.

Draft of our Gather Metrics workload for Cromwel metadata metrics ingestion. We use SNS topic to trigger notifies when job statuses changes, and triggers our Lambda function to get the metadata for the specified job and store it as JSON at our S3 landing metrics bucket.

Our metrics landing bucket combine provides data from EC2, ECS, AWS Batch, Cromwell. We have ETL pipeline that parses, organizes and prepare the data for our data analytics tools. Below a screenshot of one of our dashboards in AWS Quicksight so our business, bioinformatics and infracloud team can inspect and see the costs , workflow jobs summary statistics and much more in a detailed level. A vision for our business team (remember we are a laboratory not a IT company) where they can track the sample processing cost is critical, where the exam costs (final price for the patient) cannot be inflated by the cloud processing costs.

Simple example of one our Dashboards that uses the Metrics Data from our Genomics Cores Workloads. We place altogether: EC2, Cromwell, AWS Batch metrics so we can achieve a vision for our business team to track the processing costs in granularity of pipeline or sample.

AWS CDK

Genomics core rapidly became our standard genomics workflow infrastructure, and more teams started to use it. Our biodevops team (a special name given for a team with bioinformatic + devops skills) decided to write all the cloud infrastructure as code (IaC) with AWS Cloud Development Kit. With the AWS CDK, you can put your infrastructure, application code, and configuration all in one place, ensuring that you have a complete, cloud-deployable system. It means faster deploys and able to employ software engineering best practices such as code reviews, unit tests, and source control to make your infrastructure more robust.

Conclusion

The increasing volume of sequencing genetic tests has posed new challenges for precision medicine laboratories. Our mission was to ensure that accurate results are delivered to the physician and patient in the shortest time possible. In order to achieve it, a cloud solution on AWS has been developed to automatically run bioinformatics pipelines — genetic variant identification and annotation of biological data — in a scalable and cost-effective manner, while ensuring result reproducibility and computational efficiency. As we present along this post , we showed how we built the Genomics Core, which utilizes managed AWS services such as Fargate, Batch, and ECR, with SPOT instances to reduce processing costs, and open-source tools like WDL and Cromwell. Our solution is embedded in our Varstation® and Varsmetagen® platforms, and is also used in research projects such as Rare Genomes and bioinformatics consulting.

Recently , we have presented this project recently at the AWS Public Sector Symposium Brasília 2024, with AWS. Huge recognition for us!

Me and Karla Militão (our TAM) presenting the Genomics Core solution from Genesis/Varsomics at the WS Public Sector Symposium Brasília 2024.

Acknowledgment

My first credits are going to the two creators of the Genomics Core: Wellinton Souza and Gabriel Moraes. Amazing and intelligent guys that I had the pleasure to mentor. The bioinformatics coordinator Rodrigo Reis who placed the first stone when he fought tooth and nail the Cromwell solution with all the potential solutions for the orchestrator. A special thanks to Murilo Cervato, our manager at the time who believed in us and our potencial to create theses outstanding and challenging products. Doing science in Brazil with competence is not easy! To our bioinformatics and devops teams with support, ideas to improvement and many hours of testing and debugging. Finally, to AWS staff team: Ladeira, Karla Militão, Leo Bernardes, Araly and many others who helped us in the journey to the cloud. This project came from a AWS Innovation Sandbox project with $ credits to develop and test our POCs.