“I defy the tyranny of precedent. I go for anything new that might improve the past.” — Clara Barton
If you haven’t followed the world of sequencing in the last few years, don’t worry. It can be easily summarized as “everything has changed.” Indeed, we face a major inflection point — a transition from ship-your-sample short-read only world — to a world of long-reads at the edge.
Orders of magnitude longer than short-reads, nanopore long reads can produce high-resolution sequences of parts of the genome previously hidden to us. It can give epigenetic information by directly sequencing modified bases and teasing chromatin conformation information, shed light on centromere diversity, sequence RNA, and even sequence proteins. It has expanded modalities and simplified sample preparation, allowing farmers to diagnose their sick crops and doctors to zero in on specific leukemias from cell-free DNA affecting patients in developing nations.
Don’t Train in Vain
Sequencing at the edge means computing at the edge. Long read-based technology company Oxford Nanopore has integrated GPUs and deep learning for base calling into their sequencers to keep up with the signals coming out of them. The deep learning based base calling application, Guppy, currently has two AI options called Pirate and Flip-Flop. Users are also able to retrain Guppy AI Networks using another application, Taiyaki, to better fit the particular organism being sequenced. Retraining deep learning networks gives researchers the flexibility to improve their basecalls in a way that traditional methods like hidden markov models (HMMs) cannot.
The hackable sequencer
Jeff Nivala from the University of Washington called the MinION a “molecular-scale hackable device” at London Calling (Oxford Nanopore’s user conference). His talk introduced a new method to sequence proteins directly. Nanopore sequencers are not just for DNA and RNA, now adding proteins to the growing list of modalities it can tackle. And with new modalities comes data, lots of new data.
Oxford Nanopore has begun shipping a PromethION with 48 flowcells, affectionately known as P48 by those in the nanopore community. The P48 has demonstrated an output of 7.3 trillion DNA bases of data using all 48 flow cells at the same time in a single run. As software and chemistry improves, the community expects to see increases in the future. Next to it sits a computer equipped with four GV100 32GB NVIDIA GPUs which performs basecalling at a rate to keep up with that amount of data.
We fought the (Moore’s) law, we won
Once DNA sequencing completes, assembling reads into a genome comes next. Many DNA sequencer users want to generate new reference genomes. To assist them, we chose to contribute work in this phase of the analysis pipeline by selecting Racon to accelerate. The high memory bandwidth and high levels of parallelism of NVIDIA GPUs, plus the opportunities presented by deep learning, make GPUs uniquely suited to genomics.
Dr. Mike Vella, deep learning and genomics engineer at NVIDIA presented the team’s work on accelerating the partial order alignment (POA) graph algorithm at London Calling’s Data for Breakfast session. “The end of Moore’s law,” Dr. Vella explained to the packed room, “threatens computing, but we can keep going if we take advantage of the inherent parallelism of the GPU architecture.”
He went on to describe how GPU features like warp shuffle allow one to carry out POA computation efficiently. Finally, Dr. Vella presented benchmarking results of the team’s work and outlined their plans for the future.
Accelerating De Novo Assembly
The idea that sequencing could be democratized using a device you can carry in your pocket seemed absurd just a few years ago, let alone a sequencer that can output over 7 trillion DNA bases worth of data. Every talk I attended at London Calling felt revolutionary and defiant, the themes being “we couldn’t have done this before” followed by “there’s so much more we would like to do.” From sequencing cassava plants in the field to sequencing a plant in just 24 hours. We have jumped in head first with creating libraries that will help in the development of new tools in support of these examples of groundbreaking science.
At NVIDIA, we want to build upon this work as we learn more about other applications that could use an acceleration boost from such libraries. We intended to make Clara Genomics a foundation that enables the acceleration of ‘omic analysis, starting with de novo assembly and we invite developers to give us feedback. At London Calling, we witnessed Oxford Nanopore deliver universal access to sequencing, to anyone, anywhere, and we strive to make the compute portion of genomics workflow equally accessible to everyone.
Author: Fernanda Foertter, Deep Learning Alliance Manager and GPU Developer Advocate for Bioinformatics, HPC and AI