DNA Sequence Alignment as a Cloud Service

TL;DR: DNA sequence alignment is well-suited for the containers/microservices deployment model. Use my GitHub code and Docker image to deploy an alignment service on Google Cloud (illustrated tutorial is below) that scales automatically based on # of incoming DNA sequencer reads.

A typical analysis workflow for data coming from a DNA sequencer begins with Sequence Alignment. Most DNA sequencers emit data in batches, while the immediate downstream sequence alignment step can stream data.

DNA Sequencers: Type and Velocity of Data Output

Over the course of steady-state sequencing facility operations, DNA sequencing machines (e.g. those from Illumina) produce batches of data records in FastQ format, running for a few hours or days, and then emitting some GB or TB of data.

Data processing engineers will recognize this pattern of regular data pulses, and have specific techniques for handling “spiky” volumes of data over time. For engineers who are tasked with processing these data in a “resequencing” workflow, the first step in processing the sequencer reads is alignment.

Sequence alignment operations are stateless. Each operation takes as input a single FastQ record, compares it to a reference genome sequence to identify zero or more matches, and emits the coordinates of each match to the reference sequence. This operation is typically CPU limited, and has low requirements for disk and memory. Disk and memory requirements scale sublinearly with the reference genome size (on the order of a few GB for the human genome). An operation with these performance characteristics is a good candidate for an auto-scaling service.

Cloud Service Autoscaling

What’s autoscaling? Wikipedia defines it well:

Autoscaling […] is a method used in cloud computing, whereby the amount of computational resources in a server farm, typically measured in terms of the number of active servers, scales automatically based on the load on the farm. It is closely related to, and builds upon, the idea of load balancing.

In the context of DNA sequence analysis workflow, this means that the # of alignment operations required varies with time (caused by batch output), and that we can use the autoscaling technique to provision and release alignment resources according to demand for those resources.

Alignment Service in a Virtual Machine

In this section, I’ll describe a virtual machine image I created using Docker. When a virtual machine is created from this image, it runs a webserver. HTTP requests containing FastQ records can be sent to this webserver, and it will respond with SAM records — the output record format for aligned sequences.

Here’s the Docker image: https://hub.docker.com/r/allenday/bwa-http-docker/, and it can be run like this:

docker pull allenday/bwa-http-docker
docker run -P \
--name=bwa-http \
-e BWA_FILES=gs://some/storage/path/* \
allenday/bwa-http-docker

What the commands above do is retrieve the allenday/bwa-http-docker image to the local environment, and then start a virtual machine with the variable BWA_FILES set to a URL that corresponds to a BWA-formatted database. BWA_FILES recognizes gs://, http://, and https:// URLs, and can accept a space-delimited list of these URLs. It will also automatically decompress URLS that end in .tar.gz.

Autoscaling DNA Alignment with Google Cloud

Now that we have the Docker image, we can use it to create an instance template and autoscaling instance group. This can be done programmatically, but for ease of understanding I’ll use the graphical cloud console and take screenshots of how to do it.

As a first step, and this may be obvious, but the BWA-formatted database needs to be created and hosted in Google Cloud Storage or on a webserver visible to Google Cloud virtual machines. Check out gsutil if you want to go the Cloud Storage route.

Create an instance template

Click the hamburger in the upper left, select “Compute Engine” > “Instance Templates”:

Click “Create Instance Template”:

Now “Create an instance template”:

  • Choose a name for the template, according to your preference.
  • Increase the disk size for instances that are created based on this template. For human, 10GB is a bit tight but 20GB should be sufficient. YMMV depending on the database/species you’re analyzing.
  • Click “Allow HTTP traffic” so that instances of this template can handle incoming HTTP requests. This is how we’ll be connecting to the aligner.

Continuing with the same “Create an instance template” form:

  • Define “Metadata”: there are two variables we need to define:
  1. BWA_FILES: this is the location as gs://, http://, or https:// URL(s) to the location of BWA indexes for the reference genome(s) the aligner will use. These can end in .tar.gz and will be automatically decompressed.
  2. DOCKER_IMAGE: use allenday/bwa-http-docker for Allen Day’s image, or define your own.
  • Define “Startup Script”: In this field, we have some boilerplate code (lines 2–3) that pulls in the metadata (defined above) so that it’s available for logic on the compute instance to boot up the docker container (lines 5–6):

You don’t have to copy text from my screenshot, here’s the “Startup script” code:

Click “create”:

Create an instance group

Go to the hamburger again and choose “Compute Engine” > “Instance groups”:

Then “Create a new instance group”:

  • Define a name for your instance group
  • Define location information (optional)
  • Associate the instance template you defined previously to this instance group. This will cause instances of that template to be created and managed by the (system) process that manages this group.

Continuing with the same “Create an new instance group” form:

  • Define the properties that will be used by the autoscaling system to determine if there are enough machines, too many, or too few. You can watch a video

or read more about autoscaling here: https://cloud.google.com/compute/docs/autoscaler/

Click “create”:

Now go back to the hamburger menu “Compute Engine” > “VM Instances”, and look at the list of VM instances available. You’ll see a single instance with a name based on the instance group name you defined, with a suffix of 4 random characters:

If you see an instance, congrats! The autoscaling instance group is working.

Create a load balancer

Note that in the instance group step, VM instances are being created with auto-assigned IP addresses. While it’s possible to pre-allocate IP addresses for the max # of machines we might have available (10 in this configuration) and which of those are available, it’s a lot of bookkeeping work to effectively reimplement a routing table. Fortunately we have a load balancer. The approach we’ll use is:

  • expose LB as a single IP front-end address and port(internal port 80)
  • configure the LB to use the BWA instance group as the backend service

You can find the configuration under hamburger menu > “Networking” > “Load Balancing”. Here’s what the full configuration looks like:

The load balancer itself is simply a record with a name, and the frontend/backend services that need to be wired together (see following images)
Load balancer’s configured “backend service” is the instance group we created in the previous section
Load balancer’s configured “frontend service” is simply an IP and port (internal IP in this case)

Using the alignment service

Send a simple HTTP request with curl to test the service. You’ll need to know the instance’s IP addres (visible in the “VM Instances” console view). Then you can HTTP POST to the load balancer’s IP as shown in my session below:

Final Thoughts

This can also be implemented using Kubernetes and Google Container Engine (instead of instance templates and instance groups) for even more power and fanciness. It’s outside scope of this doc.

This web service can be built upon using other cloud services like Cloud DataFlow and Cloud Functions. More about these later!