Scale Out Your Genomics Analysis with AWS Batch

Jack
7 min readFeb 19, 2020

--

AWS Batch is a fully managed batch computing platform. It enables scientists, developers, and engineers to simply and efficiently scale to hundreds of thousands of batch computing jobs on AWS. There’s also a bit of magic here in that AWS Batch handles all of the infrastructure required to run your jobs, meaning there’s no servers, clusters, or scheduling software to manage.

Sounds great, right? Well, it is! Before we dive in further however, let’s quickly remind ourselves of the context of the problem batch computing is trying to solve.

Image Source: AWS

What is batch computing?

Formally speaking, batch computing is the “execution of a series of programs (“jobs”) on one or more computers without manual intervention” (AWS Batch FAQs, link).

When executing a batch job, you’ll ideally define input parameters through scripts, command-line arguments, config or “control” files, or with a proprietary job control language. This input is then passed into a “scheduler” of some kind, a program which oversees the availability of compute resources in a cluster. The scheduler is responsible for assigning a job to a node and managing compute resources. Jobs are executed on compute nodes, but rely on some sort of shared datastore from which to consume raw data, and output the results of the job.

Jobs may be individual components (or “steps”) or a larger process, and as such, a given batch job may depend on the completion of preceding jobs or on the availability of certain inputs. This makes the sequencing and scheduling of multiple jobs hugely important as your workload scales in complexity.

Batch computing, therefore, provides exciting benefits to organizations by increasing compute efficiency. Compute nodes are comprised of a (usually) homogeneous cluster of commodity hardware which are managed automatically, reducing the amount of manual intervention or supervision required to maintain the infrastructure. The scheduler can shift the timing of jobs to periods when more compute capacity is available, avoid idling compute resources, and enable the prioritization of jobs to align compute resources with business goals.

Enter, AWS Batch. Image Source: AWS

Why AWS Batch?

AWS Batch takes care of job execution and compute resource management for you, greatly reducing the difficulty of management and enabling you to focus on what matters: analyzing your data. Batch is optimized for a variety of use cases and workloads that take advantage of parallel processing — genomics analysis, financial risk models, deep learning, to name a few. Specifically in this article, we’re looking at some benefits to biotech firms leveraging AWS Batch for their genomics workloads.

Not only is AWS taking care of the infrastructure deployments, but they’ll handle the job scheduling too. There’s no servers to rack or manage, and iteration in a development environment is seamless. Even the act of tuning the compute environment, something which would be extremely difficult (or impossible) in a traditional static HPC infrastructure, is as simple as adjusting your Compute Environment in AWS.

Basically, all you need to bring to AWS Batch is a container and a big pile of data, AWS can take it from there. Submitting a job to the queue and watching 100 EC2s flare up into existence can be a pretty magical (and expensive, we’ll get to that) moment!

AWS Batch and Genomics Secondary Analysis Workflows

AWS Batch is a natural fit for a number of workloads involving highly parallelized computation, including secondary analysis workflows common in genomics research. In fact, AWS did a large write up a few years ago to highlight this specific use case, which walks through, in truly epic detail, how to launch your own high-throughput genomics workflow on AWS Batch. This series, broken out into four parts, provides an excellent high level, and subsequently super deep, overview of AWS Batch, the use case, and all of its moving parts. Plus, there’s a lab component which enables you to create your own pipeline. Neat!

Image Source: AWS

Summarily, “a genomics pipeline is similar to a series of Extract Transform and Load (ETL) steps that convert raw files from a DNA sequencer to a list of variants” for a person or a series of people (Building High-Throughput Genomics Batch Workflows on AWS, link). Most commonly, secondary analysis workflows take raw data generated from a sequencing device and then process them in a multi-step workflow or pipeline to identify the variation in a sample compared to a standard reference genome. These large, multi-step jobs can be difficult to manage and maintain. That’s where AWS Batch comes in.

AWS Batch needs only a container and a few simple configuration items to get cooking. Once you’ve published your container, created a Compute Environment (only a few clicks), and a Job Queue (even fewer clicks), you can submit a job and watch the magic happen. There’s no need to manage a scheduler, a compute fleet, or even the sizing of the resources. And because secondary analysis jobs are often less time sensitive, you can save money with smaller instances sizes or leveraging Spot Instances (of course, you can tune the Compute Environment for higher performance if needed).

AWS Batch allows as much or as little control over your infrastructure as you’d like.

With a little bit of effort, you can build your own custom genomics workflow in AWS Batch with any software packages you like. The only hoop to jump through is that you must be able to containerize it (note, if your workflow cannot be containerized or requires a more “bare metal” style environment, AWS ParallelCluster may be a better fit for you). Iterating in a sandbox environment becomes quick and easy when testing a new version of your container, everything else about your infrastructure remains the same. This approach enables faster development that simply isn’t possible with a traditional HPC cluster in a bare metal environment. And when you’re ready to move to production, it’s as simple as updating a Job Definition in the Job Queue.

Leveraging AWS Batch for your secondary analysis genomics workflows is a no-brainer, it’s absolutely something everyone should be looking at.

What AWS Batch Isn’t

Now that we’ve run through at what AWS Bach is and how it can help, it’s important to remember what it isn’t.

It isn’t a thick OS

A key prerequisite for using the platform is a containerized application. If you’re not containerized yet, avoid the temptation of simply wrapping your current tools in one monolithic container. Take the opportunity to holistically review your workflow, and break your process apart into steps. These steps will become the basis for your containers, and subsequently the steps in your AWS Batch pipeline. (Worth noting, take a peek at AWS Step Functions while you’re at it.)

It isn’t persistent

AWS Batch is ephemeral by its nature. All of your containers must exit when their tasks complete, at which point the underlying EC2 instance will be terminated. This is the magic of AWS Batch, but means you must write your output somewhere else. Any persistent data, analysis, ETL output, etc., must be written elsewhere or be lost to the great hard drive in the sky. S3 is a great option for outputting data, and opens up other possibilities for further analysis (see AWS Glue, Amazon Athena, and Redshift Spectrum).

It isn’t a cost optimization silver bullet

Finally, AWS Batch is a great way to reduce operational costs and drive higher compute efficiency, but it absolutely is not free. In fact, depending on your job configuration, you might find a very surprising bill at the end of the month! First and foremost, ensure your container exits when its job is complete (otherwise the EC2 instance will never terminate). Next, run a few test jobs and begin to tune for optimization.

Perhaps your job is very memory intensive and an R class instance offers a better CPU/memory combination. Maybe a GPU instance will help push the analysis through more quickly, but they do cost quite a bit more to run. AWS Batch is a tool to scale out your analysis and empower your team, which may not necessarily lead to reduced costs.

Conclusions

AWS Batch is a fantastic platform to scale your batch computing workload with minimal operational overhead and strain. If you’re already containerized, AWS Batch may be a great fit to take your parallel processing to the next level. If you’re not containerized, this might be just the nudge you’re looking for to embrace a more modernized approach to batch computing.

--

--

Jack

Architecting on AWS with @privoit. Opinions and comments here are my own.