Why we chose Luigi for our NGS pipelines

Our biotech startup runs all of its computational genomics on a workflow manager built by a streaming music app.

Jake Feala
Nov 9, 2015 · 8 min read

Our philosophy for choosing technology

When our company began operations in 2012, our tiny bioinformatics group started with a clean slate for how we were going to analyze the mountains of NGS data that would be at the core of our genomics-driven business model. Our platform would have to be able to scale to hundreds of batches, each with hundreds of samples. It would have to be robust and reproducible. But, most importantly, it would have to be flexible enough to adapt to the rapidly evolving genomics landscape.

1. No lock-in, no monoliths, no walled gardens. Build an ecosystem of components that do one thing well

This first tenet has been the most important. Without naming names, it’s fair to say the majority of commercial bioinformatics platforms are walled gardens at best — and black-box, monolithic, shrink-wrapped solutions at worst. The field is moving too rapidly for a monolithic solution, because it is highly unlikely that any one system can get all of the many aspects exactly right. We prefer a loosely-coupled architecture connected by solid APIs, similar to the microservices trend in IT, where it is easy to swap out algorithms, pipelines, data, and technologies when they become outdated.

2. Use general purpose tools when possible (not bioinformatics specific)

Tenet #2 addresses the tendency in our field to build custom, bioinformatics-specific solutions for problems that have more general solutions from the software and data science community. Genomics data is still data, and we don’t want to use anything that reinvents the wheel.

3. Favor open components with a large community

The third point is somewhat unfair to all of the fantastic, smaller projects out there, but to maintain some stability we would like to choose technology with a large following, so we can be sure it will continue to be maintained and updated for a long time. The preference for open-source is due to the obvious reason that we can improve any code that doesn’t fit our needs, but also because, in my experience, open-source projects seem to improve more quickly than commercial software.

The major components of an NGS platform

Before getting into Luigi, I want to clarify all of the components that I’m referring to with the somewhat vague term “platform.” An NGS platform, in our view, is a system for processing, analyzing, and visualizing DNA sequence data at every level, from metadata to raw sequence reads to the high-level, interactive visualizations that provide insights to our biologists.

All major genomics platforms use essentially the same major components
  1. Track sample metadata
  2. Manage environments and (sometimes clashing) tool dependencies
  3. Scale up these single-node processes in parallel
  4. Easily connect command-line tools into pipelines via intermediate files

Most NGS platforms use essentially the same basic components.

Most NGS platforms, commercial and academic, are essentially composed of the following components to the meet the above requirements. Here I outline our choices for each:

1. Scalable object storage

We chose Amazon S3 for our file storage. Although cloud-based NGS platform-as-a-service (PaaS) solutions also use S3 as their back-end, they only allow data to be accessed through their API. Therefore, these are walled gardens that prevent us from performing operations such as http access, bucket-to-bucket transfers, direct loading into other environments (e.g., Redshift and Spark), or even simply choosing a different GUI or CLI to manipulate the data. Keeping the data in our own S3 buckets has been a cheaper and more flexible solution.

2. File metadata store

The metadata store can be any database or LIMS (Lab Information Management System), and for most biotech companies these are put in place on day one. Commercial systems usually bake a metadata database into the platform. We decided to use our internal LIMS since it has a simple API and was specially designed for tracking this type of highly relational metadata. We only had to add a special entity type for genomics samples, and to tack on a field with links to a URI where the sample data is stored.

3. Containerized environments for each tool

Environment management was also an easy decision. Containerization is the state of the art for managing and sandboxing a compute environment, and containerization has essentially become synonymous with Docker in recent years. We have fully adopted Docker to quickly build and deploy a consistent environment for our finicky tools.

4. Compute cluster and job scheduler

There are several good solutions for managing an elastic compute cluster and scheduling jobs. I’m not going to go into our choice for provisioning clusters of EC2 instances, because this component will likely be the next to be upgraded.

5. Workflow manager

At the center of this ecosystem of components I’ve just described is the data pipeline or workflow. For small workflows, this can be as simple as a bash or Python script, but above a moderate level of complexity there is a great advantage to having workflow management software.

Luigi ties it all together

A great workflow management solution provides:

  • Idempotency: This is a fancy word for producing the same output every time, but in practice it means that completed tasks are not run twice, so a failed workflow can be restarted from the middle, picking up where it left off.
  • Modularity: Workflows are broken up into tasks that can be swapped in and out, run on their own, or re-wired to different inputs and outputs

Simple, elegant API:

Tasks have requires(), output(), and run() methods. These methods return Tasks and Targets. Targets either exist() or not. You can build a giant, complex tree of dependencies with these rules, while retaining a simplicity that anyone reading your code can easily follow.

Flexibility:

Everything is “just” Python, so Tasks can do anything, including wrap shell commands, call remote APIs, or dump to databases. Targets can also be anything: files, S3 object, database records, whatever you want. Using Luigi guarantees that you’ll be able to glue all of the other components together, no matter which choices you make or constraints you might have.

Batteries included:

There are Tasks and Targets for Hadoop, Spark, Redshift, MySQL, and pretty much any other data technology that you might consider adopting. Luigi will be able to glue your current stack with the next technology that hasn’t been invented yet.

Final thoughts

As much as I hate to advertise yet another workflow strategy for bioinformatics, I’ve really enjoyed building workflows on this stack. I think a huge advantage is the fact that it is a stack, rather than a one-size-fits-all, monolithic solution. Although managing multiple components brings additional overhead, the benefit of using components that “do one thing well” far outweighs the costs, in our experience.


Outlier Bio blog

Thoughts about the field of bioinformatics, and how to make it better

Jake Feala

Written by

Full-stack genomics data engineer. Independent consultant. Entrepreneur in a love-hate relationship with the field of bioinformatics.

Outlier Bio blog

Thoughts about the field of bioinformatics, and how to make it better