miniwdl, a runtime and developer toolkit for the bioinformatics Workflow Description Language

Mike Lin
Mike Lin
Nov 25, 2019 · 4 min read

Miniwdl is a new runner and Python developer toolkit for the Workflow Description Language (WDL), used to specify bioinformatics data processing workflows in projects like the Human Cell Atlas. This open-source project is in beta testing now, and we invite the community to try it out and build on it.

Modern sequencing and imaging assays generate vast raw data streams, which must be refined and normalized for further scientific analysis. These automated bioinformatics workflows typically employ numerous software tools from open-source and in-house authors, organized into complex data processing pipelines, and carrying a range of software dependencies, hardware requirements, and parallelization potential. This makes it challenging to reuse complete workflows across diverse computing platforms.

WDL is a language for specifying such pipelines — it’s portable across platforms and easy to read & write for bioinformaticians and software engineers alike. Originally incubated to facilitate the Broad Institute’s production bioinformatics, the language is now under community-driven stewardship and among the open technologies adopted within Global Alliance for Genomics and Health standards.

Miniwdl extends the WDL ecosystem with developer productivity tools, including a local runner and source code linter, as well as a Python library for programmatic access to its WDL parser, static analysis framework, and runtime system. Beyond supporting workflow developers, miniwdl provides a foundation to build new WDL tools and platform-specific runners. Lastly, it supports the OpenWDL specification community with an accessible codebase for prototyping new language features.

Structure and interpretation of bioinformatics workflows

Key to WDL’s usability is its domain-specific language (DSL) which feels natural to workflow developers and analysts — but also creates significant complexity for WDL tools to parse and interpret, potentially constricting its open-source ecosystem. Let’s illustrate this with a simple example.

Starting with each individual processing task, the developer specifies a Docker image encapsulating its software dependencies, along with its inputs & outputs, command-line invocation, and hardware requirements. For example, the following WDL task uses samtools to count genome sequence read mappings in a specified genomic region from an indexed BAM file.

task samtools_count {  input {    File bam    File bai    String? region  }  command <<<    samtools view -c "~{bam}" "~{region}"  >>>  output {    Int count = read_int(stdout())  }  runtime {    docker: "biocontainers/samtools:v1.9-4-deb_cv1"  }}

Tasks are then arranged in sequential or parallel stages of an end-to-end workflow, which can be used for data production or further composed as a step in a higher-level workflow. Here we use a WDL “scatter” stage to parallelize the samtools_count task across the 22 human autosomes.

version 1.0workflow bam_chrom_counter {  input {    File bam    File bai    Int num_chrom = 22  }  scatter (i in range(num_chrom)) {    String chrom = "chr~{i+1}"    call samtools_count {      input:        bam = bam, bai = bai,        region = chrom    }  }  output {    Array[Pair[String,Int]] counts = zip(chrom, 
samtools_count.count)
}}

From just this brief sample, we can already observe some of the DSL’s features that help bioinformaticians write robust yet concise workflows: parametric data types, polymorphic functions, string interpolations with embedded expressions, and automatic dependency analysis. This expressiveness requires WDL interpreters to not only parse the DSL, but also typecheck the syntax tree and evaluate its functional expressions, to even reach the point of running any command. In fact, miniwdl’s DSL front-end library has several times more code than its runtime system for actually executing tasks and workflows!

Miniwdl’s reusable Python library, alongside Cromwell’s Scala/JVM stack, should help to lower this barrier for developers to contribute new tools to the WDL ecosystem. Early examples building with miniwdl include WDL DevTools, providing instant error-checking in many code editors through Language Server Protocol, and WDL-AID, which generates workflow documentation based on source code annotations.

Miniwdl itself benefits from Lark to handle low-level lexing/parsing, as well as Python static analysis tools like Pylint and Pyre, which not only support its own code quality, but also provide meta-inspiration for its analogous WDL tools.

Scaling up and yet staying “mini”

Miniwdl’s runtime system currently orchestrates containers on the local host (which could be a powerful server supporting many parallel tasks). As mentioned, it enjoys a fairly compact codebase — and we’d like to keep that as a core value, befitting its name, even while scaling up for larger workloads.

One way we’ve kept it tight so far has been to reuse Docker’s built-in Swarm mode for many aspects of container scheduling, rather than reimplementing parallel resource allocation and queuing logic. This could extend to multi-node Swarm clusters with appropriate configuration and without further dependencies. In the future, miniwdl will also schedule through Kubernetes, and expose an interface to enable plugging into others like YARN or traditional batch queues (though those additional integrations might not come built-in to miniwdl itself).

Besides container scheduling, cloud-enabled workflow runners historically incurred significant complexity adapting bioinformatics tools’ filesystem assumptions to interact with cloud object storage services. The recent availability of cloud-managed filesystems like Amazon FSx, Azure Files, and Google Cloud Filestore will streamline this aspect of scaling miniwdl as well.

Miniwdl is now in beta testing. It accelerates the WDL code/test/debug cycle for bioinformaticians; its runner can be employed for small-to-medium data production use cases; and its library helps Python developers create further complementary tools. There’s a Getting Started tutorial and we welcome all issue reports, feedback, and contributions on the MIT-licensed repository.

CZI Technology

The Technology team at Chan Zuckerberg Initiative

Mike Lin

Written by

Mike Lin

Genome informatics R&D

CZI Technology

The Technology team at Chan Zuckerberg Initiative

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade