Why we chose Luigi for our NGS pipelines
Our biotech startup runs all of its computational genomics on a workflow manager built by a streaming music app.
Next-generation sequencing (NGS) is a rapidly moving field. From the sequencing technology itself, to the clever applications of genomics, to the downstream data analysis, the speed of recent advances has been as staggering as the volumes of data being produced. Choosing the right technology is crucial to keep up with this field, and I want to share some of our experiences in putting together an NGS platform at our small biotech*. In particular I want to highlight how Spotify’s open-source Luigi workflow manager has been the glue that holds our data pipeline platform together.
Our philosophy for choosing technology
When our company began operations in 2012, our tiny bioinformatics group started with a clean slate for how we were going to analyze the mountains of NGS data that would be at the core of our genomics-driven business model. Our platform would have to be able to scale to hundreds of batches, each with hundreds of samples. It would have to be robust and reproducible. But, most importantly, it would have to be flexible enough to adapt to the rapidly evolving genomics landscape.
Over that time our platform has gone through several iterations, including some bad choices and a few false starts. From these we learned that the specific technology choices will always be evolving, but we’ve converged on some key principles to drive our choices going forward.
1. No lock-in, no monoliths, no walled gardens. Build an ecosystem of components that do one thing well
This first tenet has been the most important. Without naming names, it’s fair to say the majority of commercial bioinformatics platforms are walled gardens at best — and black-box, monolithic, shrink-wrapped solutions at worst. The field is moving too rapidly for a monolithic solution, because it is highly unlikely that any one system can get all of the many aspects exactly right. We prefer a loosely-coupled architecture connected by solid APIs, similar to the microservices trend in IT, where it is easy to swap out algorithms, pipelines, data, and technologies when they become outdated.
2. Use general purpose tools when possible (not bioinformatics specific)
Tenet #2 addresses the tendency in our field to build custom, bioinformatics-specific solutions for problems that have more general solutions from the software and data science community. Genomics data is still data, and we don’t want to use anything that reinvents the wheel.
3. Favor open components with a large community
The third point is somewhat unfair to all of the fantastic, smaller projects out there, but to maintain some stability we would like to choose technology with a large following, so we can be sure it will continue to be maintained and updated for a long time. The preference for open-source is due to the obvious reason that we can improve any code that doesn’t fit our needs, but also because, in my experience, open-source projects seem to improve more quickly than commercial software.
The major components of an NGS platform
Before getting into Luigi, I want to clarify all of the components that I’m referring to with the somewhat vague term “platform.” An NGS platform, in our view, is a system for processing, analyzing, and visualizing DNA sequence data at every level, from metadata to raw sequence reads to the high-level, interactive visualizations that provide insights to our biologists.
Practically all analysis tools for NGS data converge on the Linux command-line, and share data between these Linux tools via files in custom formats. We can (and will in a future post) bemoan this state our field is in, but these are the constraints we face if we want to do any practical work with NGS. With this restriction in mind, any production system must, at the minimum:
- Store and distribute raw and processed data
- Track sample metadata
- Manage environments and (sometimes clashing) tool dependencies
- Scale up these single-node processes in parallel
- Easily connect command-line tools into pipelines via intermediate files
Most NGS platforms use essentially the same basic components.
Most NGS platforms, commercial and academic, are essentially composed of the following components to the meet the above requirements. Here I outline our choices for each:
1. Scalable object storage
We chose Amazon S3 for our file storage. Although cloud-based NGS platform-as-a-service (PaaS) solutions also use S3 as their back-end, they only allow data to be accessed through their API. Therefore, these are walled gardens that prevent us from performing operations such as http access, bucket-to-bucket transfers, direct loading into other environments (e.g., Redshift and Spark), or even simply choosing a different GUI or CLI to manipulate the data. Keeping the data in our own S3 buckets has been a cheaper and more flexible solution.
2. File metadata store
The metadata store can be any database or LIMS (Lab Information Management System), and for most biotech companies these are put in place on day one. Commercial systems usually bake a metadata database into the platform. We decided to use our internal LIMS since it has a simple API and was specially designed for tracking this type of highly relational metadata. We only had to add a special entity type for genomics samples, and to tack on a field with links to a URI where the sample data is stored.
3. Containerized environments for each tool
Environment management was also an easy decision. Containerization is the state of the art for managing and sandboxing a compute environment, and containerization has essentially become synonymous with Docker in recent years. We have fully adopted Docker to quickly build and deploy a consistent environment for our finicky tools.
4. Compute cluster and job scheduler
There are several good solutions for managing an elastic compute cluster and scheduling jobs. I’m not going to go into our choice for provisioning clusters of EC2 instances, because this component will likely be the next to be upgraded.
For a workflow composed of Docker containers, there are also a few different choices for allocating resources and deploying containers across a cluster. We chose EC2 Container Service. When an AWS service meets our requirements, we generally choose them because they are often inexpensive, well integrated with other AWS services, and rapidly add new features.
5. Workflow manager
At the center of this ecosystem of components I’ve just described is the data pipeline or workflow. For small workflows, this can be as simple as a bash or Python script, but above a moderate level of complexity there is a great advantage to having workflow management software.
For better or worse, there are a ridiculous number of options for bioinformatics workflow software. There is a home-grown solution for just about every major academic sequencing center, and many, many smaller, bioinformatics-specific workflow tools built in academic labs. There are workflow tools for every skill level, from biologist to hard-core programmer, and general-purpose workflow libraries for every programming language. Python alone has three that I’m aware of (Luigi, Snakemake, and Ruffus).
With so many options, I’m sure there is more than one excellent choice. And since it is impossible to try them all, it feels a bit unfair to advocate for a single one.
But I’m going to stump for Luigi anyway.
Luigi ties it all together
A great workflow management solution provides:
- Atomicity: Only successful jobs produce an output, so there are no partially complete data objects from failed jobs.
- Idempotency: This is a fancy word for producing the same output every time, but in practice it means that completed tasks are not run twice, so a failed workflow can be restarted from the middle, picking up where it left off.
- Modularity: Workflows are broken up into tasks that can be swapped in and out, run on their own, or re-wired to different inputs and outputs
Workflow software often includes additional features such as monitoring, scheduling, and graphical interface for chaining together tasks, but the bullets above are the big ones. Restricting our options to those that meet those three requirements narrows the field. Among the options we explored, Luigi far exceeds the rest for several reasons:
Simple, elegant API:
Tasks have requires(), output(), and run() methods. These methods return Tasks and Targets. Targets either exist() or not. You can build a giant, complex tree of dependencies with these rules, while retaining a simplicity that anyone reading your code can easily follow.
Everything is “just” Python, so Tasks can do anything, including wrap shell commands, call remote APIs, or dump to databases. Targets can also be anything: files, S3 object, database records, whatever you want. Using Luigi guarantees that you’ll be able to glue all of the other components together, no matter which choices you make or constraints you might have.
There are Tasks and Targets for Hadoop, Spark, Redshift, MySQL, and pretty much any other data technology that you might consider adopting. Luigi will be able to glue your current stack with the next technology that hasn’t been invented yet.
As much as I hate to advertise yet another workflow strategy for bioinformatics, I’ve really enjoyed building workflows on this stack. I think a huge advantage is the fact that it is a stack, rather than a one-size-fits-all, monolithic solution. Although managing multiple components brings additional overhead, the benefit of using components that “do one thing well” far outweighs the costs, in our experience.
Philosophically, the idea is like microservices for data pipelines, in that the containers running individual data processing tasks are stateless, independently scalable, and have well-defined APIs (i.e. the Luigi requires() and output() interface between tasks).
In parallel, the data engineers at AdRoll converged on a similar Luigi/Docker/S3 architecture for their petabyte-scale workflows, except that they also restrict all container inputs and outputs to be immutable S3 objects. In their blog post, they are able to draw interesting comparisons to functional programming, with its idempotent, stateless, side-effect-free properties. I highly recommend the video and slideshare as well.
If any of this sounds cool to you, please let me know in the comments. I will be open-sourcing some example code for simple, common bioinformatics pipelines in this framework, and I’d love to hear feedback from users or contributors.
*disclaimer: I have since left this company to become an independent consultant