Image for post
Image for post

Why we chose Luigi for our NGS pipelines

Our biotech startup runs all of its computational genomics on a workflow manager built by a streaming music app.

Jake Feala
Nov 9, 2015 · 8 min read

Next-generation sequencing (NGS) is a rapidly moving field. From the sequencing technology itself, to the clever applications of genomics, to the downstream data analysis, the speed of recent advances has been as staggering as the volumes of data being produced. Choosing the right technology is crucial to keep up with this field, and I want to share some of our experiences in putting together an NGS platform at our small biotech*. In particular I want to highlight how Spotify’s open-source Luigi workflow manager has been the glue that holds our data pipeline platform together.

Our philosophy for choosing technology

Over that time our platform has gone through several iterations, including some bad choices and a few false starts. From these we learned that the specific technology choices will always be evolving, but we’ve converged on some key principles to drive our choices going forward.

1. No lock-in, no monoliths, no walled gardens. Build an ecosystem of components that do one thing well

2. Use general purpose tools when possible (not bioinformatics specific)

3. Favor open components with a large community

The major components of an NGS platform

Image for post
Image for post
All major genomics platforms use essentially the same major components

Practically all analysis tools for NGS data converge on the Linux command-line, and share data between these Linux tools via files in custom formats. We can (and will in a future post) bemoan this state our field is in, but these are the constraints we face if we want to do any practical work with NGS. With this restriction in mind, any production system must, at the minimum:

  1. Store and distribute raw and processed data
  2. Track sample metadata
  3. Manage environments and (sometimes clashing) tool dependencies
  4. Scale up these single-node processes in parallel
  5. Easily connect command-line tools into pipelines via intermediate files

Most NGS platforms use essentially the same basic components.

Most NGS platforms, commercial and academic, are essentially composed of the following components to the meet the above requirements. Here I outline our choices for each:

1. Scalable object storage

2. File metadata store

3. Containerized environments for each tool

4. Compute cluster and job scheduler

For a workflow composed of Docker containers, there are also a few different choices for allocating resources and deploying containers across a cluster. We chose EC2 Container Service. When an AWS service meets our requirements, we generally choose them because they are often inexpensive, well integrated with other AWS services, and rapidly add new features.

5. Workflow manager

For better or worse, there are a ridiculous number of options for bioinformatics workflow software. There is a home-grown solution for just about every major academic sequencing center, and many, many smaller, bioinformatics-specific workflow tools built in academic labs. There are workflow tools for every skill level, from biologist to hard-core programmer, and general-purpose workflow libraries for every programming language. Python alone has three that I’m aware of (Luigi, Snakemake, and Ruffus).

With so many options, I’m sure there is more than one excellent choice. And since it is impossible to try them all, it feels a bit unfair to advocate for a single one.

But I’m going to stump for Luigi anyway.

Luigi ties it all together

  • Atomicity: Only successful jobs produce an output, so there are no partially complete data objects from failed jobs.
  • Idempotency: This is a fancy word for producing the same output every time, but in practice it means that completed tasks are not run twice, so a failed workflow can be restarted from the middle, picking up where it left off.
  • Modularity: Workflows are broken up into tasks that can be swapped in and out, run on their own, or re-wired to different inputs and outputs

Workflow software often includes additional features such as monitoring, scheduling, and graphical interface for chaining together tasks, but the bullets above are the big ones. Restricting our options to those that meet those three requirements narrows the field. Among the options we explored, Luigi far exceeds the rest for several reasons:

Simple, elegant API:

Flexibility:

Batteries included:

Final thoughts

Philosophically, the idea is like microservices for data pipelines, in that the containers running individual data processing tasks are stateless, independently scalable, and have well-defined APIs (i.e. the Luigi requires() and output() interface between tasks).

In parallel, the data engineers at AdRoll converged on a similar Luigi/Docker/S3 architecture for their petabyte-scale workflows, except that they also restrict all container inputs and outputs to be immutable S3 objects. In their blog post, they are able to draw interesting comparisons to functional programming, with its idempotent, stateless, side-effect-free properties. I highly recommend the video and slideshare as well.

If any of this sounds cool to you, please let me know in the comments. I will be open-sourcing some example code for simple, common bioinformatics pipelines in this framework, and I’d love to hear feedback from users or contributors.

*disclaimer: I have since left this company to become an independent consultant

Outlier Bio blog

Thoughts about the field of bioinformatics, and how to make…

Jake Feala

Written by

Digitizing biology from hypothesis to experiment and back. Currently Head of Computational Biology at Generate Biomedicines. Twitter @FealaJake

Outlier Bio blog

Thoughts about the field of bioinformatics, and how to make it better

Jake Feala

Written by

Digitizing biology from hypothesis to experiment and back. Currently Head of Computational Biology at Generate Biomedicines. Twitter @FealaJake

Outlier Bio blog

Thoughts about the field of bioinformatics, and how to make it better

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store