Tools for big genomics are finally here
tl;dr Although genomics has become big data, its mainstream analysis toolkits have not kept up. Hadoop and other big data technologies are better suited, but their implementation has remained out of reach for bioinformatics researchers in small labs and companies. This will quickly change, however, because the right APIs and technologies are starting to arrive. Researchers can now unlock new insights from petabytes of under-utilized genomics data.
The single-node mindset
Against all odds, even as our field floods with data, a single-node mindset has survived in the field of bioinformatics. We build “pipelines,” which are complex series of analysis steps performed on a data file by a bespoke command-line tool on a single server, each step reading, transforming, and saving the data back to disk before moving to the next step. Even when we use multi-node clusters to parallelize our workflows, we are essentially using exactly the same single-node workflow but running a different sample through each node. Our field has stubbornly stuck with this approach even as the data far outgrew it.
Meanwhile, the tech industry has forged ahead. Hadoop and its surrounding ecosystem solved the big data problem by distributing each file across multiple nodes, performing computations for a particular slice of data on the same computer storing that data (“moving compute to the data”, in tech terminology), and making parallelism implicit and unified.
But, stuck in academic silos, genomics has held on to the old paradigm of lugging files through pipelines based on single nodes, reimplementing parallelism with each new tool, and forcing our infrastructures to adapt through truly impressive feats of engineering .
Genomics analysis = data reduction
One consequence of the single-node, command-line tool culture is that bioinformatics scientists explore only a massively reduced subset of their genomics data, pared down by many orders of magnitude by our processing pipelines — which are essentially static, enormously complex data reduction algorithms . Data reduction is fine in principle and common practice in data science, but the problem in genomics is that the process is usually done exactly one time. Once the data have been reduced to manageable size (say, a list of variants), the raw data are never touched again. So you’d better hope your first data reduction algorithm is the best (spoiler — it’s not).
Last year researchers reported the discovery of a mutation in the gene RNF43 that was found in nearly 1 in 5 colorectal and endometrial cancers, which makes it one of the most common mutations in these tumors. However, the original variant-calling algorithm missed these mutations. Re-running the dataset with multiple algorithms likely would have found the mutation, but this is a monumental task with the current state of our pipelines. Luckily in this case, the Broad Institute has a framework in place to re-run their pipelines. But most labs and enterprises dealing with sequencing data are not equipped to repeatedly tweak their pipelines and reprocess large numbers of samples.
With big data technology, machine learning might be used to extract structure from raw DNA sequences, rather than relying on predefined gene models and heuristics.
Imagine the lost opportunities from neglecting terabytes of data. Next generation sequencing data contain multiple, rich layers of information that can best be understood by iteratively exploring the raw sequence data. For example, RNA-Seq is used to measure gene expression, but mutations, mRNA editing, novel transcripts, and complex splicing patterns can also be seen in the raw alignments. With big data technology, machine learning might be used to extract structure from raw DNA sequences, rather than relying on predefined gene models and heuristics of the data reduction steps.
Furthermore, even the reduced datasets will soon grow beyond our ability to process locally. Single-cell sequencing is revealing heterogeneity in the genomes of different cells in the same organism, and scientists have known for years that cancer genomes vary across time and space in the same patient. When you consider the possibility that in the near future, we will all have our genomes sequenced — and not just once, but many times — even the reduced lists of variants, now of manageable size, will themselves become unwieldy.
Genomics is, and is not, special
I am nowhere near the first to describe this problem. “Genomics data is not special,” says Uri Laserson in a talk that started a small scuffle among bioinformaticians, in that the data is “just” data and should be treated as such. Some argued that genomics is special, and of course they’re right, but in same way that click-through rate data is special, astronomy data is special, geospatial data is special, and in fact any data is special in terms of the particulars of its own domain. Uri’s point is that it’s still data, and while the details of the analysis and interpretation are of course vastly different, we shouldn’t be storing it, formatting it, accessing it, and moving it through workflows differently than in any other domain.
Skeptics of this line of thinking also claim that genomics is not actually big data yet. And they are right, it isn’t yet — at least for most of the problems that people are currently working on. But this is partly due to the Streetlight Effect (i.e., looking for your lost keys only under the streetlamp, because that’s where the light is). Researchers naturally concentrate on the data that are available to them. Sequencing consortia such as The Cancer Genome Atlas (TCGA) are producing undeniably “big” data and making it public, but only the reduced, preprocessed outputs are being fully utilized by the community. When the “big,” lower-level data do become more accessible through better technology, I expect an accompanying explosion of new research questions that are currently intractable and are therefore ignored.
There will soon be a major divide between the “haves,” who can glean insights from genomics at this scale, and the “have-nots,” who are not equipped to handle big genomics.
TCGA was just the beginning. With the UK’s 100k patient sequencing initiative, Human Longevity, Inc.’s goal of 40k human genomes per year, and several other DNA sequencing initiatives in the works, there will soon be a major divide between the “haves,” who can glean insights from genomics at this scale, and the “have-nots,” who are not equipped to handle big genomics. That is, unless the bioinformatics world undergoes a massive upheaval in the way we approach data.
As a field, we need to be very scared of this amount of data — nervous enough to consider other paradigms, and to ease ourselves away from custom file formats, command-line tools, and Perl scripts. It’s time to take the first step and admit we might have a problem, and to think about how to treat big data the way the rest of the world treats big data.
Data APIs over CLIs and file formats
The transition to big genomics data will be made easier by the emergence of genomics APIs. The Global Alliance for Genomics and Health (GA4GH) is building data APIs for querying NGS reads and variants across the internet.
The beauty of a data API is that it decouples data engineering and data munging (boring, but comprises most of our work) from the analysis step (fun, but only a small portion of the work). Data APIs shift the “munging” from something that all of the data consumers do each time, to something that the data producers do only once. When data APIs are commonplace, they will unleash vastly more time to do the creative, productive, unique analyses and visualizations that makes science fun.
As of right now, Google Genomics is the biggest, most accessible implementation of GA4GH, but there will soon be competitors and solutions to host your own data behind these APIs using any cloud platform. Last year, the National Cancer Institute (NCI) announced its “Cloud Pilot” program, in which 3 institutions were chosen to build cloud-based data platforms and APIs (using GA4GH) for making cancer genomics data accessible to the community. The three recipients of NCI funding are a mix of commercial entities and academic institutions:
- Seven Bridges Genomics, a Cambridge, Mass. company building a genomics workflow platform on Amazon cloud
- a joint consortium between genomics stalwarts Broad Institute, UC Santa Cruz, and UC Berkeley
- the Seattle-based Institute for Systems Biology (ISB), in partnership with Google
These cloud platforms are now online and ready for early adopters. Additionally, companies such as SolveBio are pouring resources into data APIs for variants as well as other types of genomic data.
Eventually, all of the major public genomic data will be accessible through standard APIs, which will allow researchers to scale their queries across petabytes of data, using high-level languages such as R and Python. I can’t wait for the day that I will be able to run the same query across all of TCGA, GTEx, or the 1000 Genomes Project using just a few lines of code.
Exploring big genomics with Spark and ADAM
The limitation of APIs, however, is that the data still have to travel from the cloud datacenter to the client performing the request. This is fine for small, specific queries, but we still need a viable new solution for batch analytics on big genomic datasets. This is where a project called ADAM, stemming from Berkeley’s AMPLab, will be revolutionary. A platform based on ADAM will comprise a major part of the NCI Cloud Pilot deliverable from the Broad/UCSC/Berkeley group.
Adam is a technology stack that essentially decouples the data model from the implementation of compute and storage. As with the GA4GH, they use the general-purpose data modeling language Avro to develop a schema for variants and alignments (the schemas produced from the two efforts will presumably merge). Then, on top of this schema, they provide APIs for distributed computation with Apache Spark  and efficient storage with Apache Parquet.
Spark is a cluster computing system widely believed to be the successor to Hadoop, and achieves a performance advantage by caching data in memory rather than writing to disk. For data scientists, Spark allows interactive exploration of massive datasets distributed across a large cluster, using APIs for high-level languages such as Python, R, and SQL.
On one hand, that Spark/ADAM stack may not change your life if all you want to do is run standard, best-practice NGS pipelines on a few samples. To be sure, variant calling with ADAM is more elegant and not to mention 28 times faster than the standard command-line tools, but it may not be worth the learning curve.
For big genomics data, however, ADAM and Spark will be a game-changer. Scientists with novel research questions will be empowered to perform interactive and batch NGS analyses that would be unthinkable within the single-node, command-line paradigm.
A welcome new era for bioinformatics
Both Spark/ADAM and the GA4GH APIs are still in early stages, but they have arrived. What both of these technologies share is the revolutionary potential to shift the paradigm from “pipelines of command-line tools” to “transformations of data models.” This will do three things:
- Free us from the scourge of file formats and tools, bringing focus back to the data,
- Lower the barrier of entry to NGS analysis for biologists and data scientists from other field, and
- Open the world of big data to bioinformatics scientists, allowing them to quickly test small hypotheses on huge datasets.
These changes will shift our perspective of genomics data and will contribute to a welcome new era for our field.
 For example, the Broad Institute has processed petabytes of data through its custom Firehose platform, which wraps shell workflow steps as LSF batch jobs.
 In the case of genomics, after alignment to a reference genome, common data reduction steps include differential comparison to the genome (i.e., variant calling), or counting reads that overlap a predefined set of features (e.g., gene expression for RNA-Seq data).