Bioinformatics with Pachyderm — Shell Scripts at Scale

Daniel Whitenack
3 min readNov 8, 2017

--

Biology and its sub-disciplines, like genomics, have become incredibly data intensive in recent years. Methods like high-throughput sequencing and mass spectrometry generate huge amounts of data that need to be processed in a reproducible and scalable manner. However, many workflows in bioinformatics and genomics are driven by a series of shell scripts that are, at least in some cases, manually triggered.

This begs the question: how can bioinformatics and genomics professionals utilize the tooling that they are familiar with or need (i.e., shell scripts that wrap a variety of specialized tooling) in a more sustainable, reproducible, and scalable manner? As it turns out, many are turning to Pachyderm as the answer!

Let’s take a common use case as an example. Suppose we want to do variant calling, which identifies variations in gene sequencing data, using tools from the Genome Analysis Toolkit (GATK). A typical workflow for this might include a couple of shell scripts to (i) find variant likelihoods for various input files, and (ii) perform joint genotyping.

Of course you could manually get the input files to these shell scripts, trigger them, and gather the results, but this is time consuming and doesn’t scale to high-throughput scenarios. It’s also far from reproducible.

With Pachyderm, researchers can utilize the exact same shell scripts that they use locally, but in a scalable, automated, and reproducible data pipeline! To get this up and running, researchers just need to:

  1. Add input data into data repositories. Pachyderm versions all the data that is processed in versioned collections of data called repositories (think “git for data”). Data can be added to these via CLI, script, language client, or cron.ter nodes, versioning results, and
  2. Give Pachyderm a simple JSON specification. This specification descriptively tells Pachyderm which scripts to run on which data repositories. The scripts can be run as is in an officially supported Docker image (e.g., one of the GATK Docker images) or a custom Docker image.

After that, Pachyderm takes care of all of the details related to scheduling the work on underlying Kubernetes cluster nodes, versioning results, and tracking data provenance. Rather than manually running the scripts and trying to get the right data to the right code, any data fed into an input data repository will automatically trigger the necessary downstream processing.

This allow researchers to focus on innovation and discovery as opposed to moving data from place to place and running commands in their terminal.

Researchers can even crank up the parallelism of each stage of their data pipeline by specifying a number of workers in the JSON pipeline specification. When this happens, Pachyderm will automatically start running the respective shell script on multiple workers (Kubernetes pods under the hood), shard the input data across these workers, and gather the results into a common output data repository.

So if you’re ready to level-up your shell scripts, check out our turn-key Pachyderm + GATK example that can be spun up quickly using our local installation. Also, be sure to:

--

--