Genome Assembly at Warp Speed
This piece represents ARK’s newest form of written content: “Works in Progress (WIP)”. WIP content is a snapshot of our internal research process. These pieces are unfinished and generally more technical than ARK’s blog content. The goal of WIP content is to open our thinking as it evolves such that we can gather feedback and learn quickly if we are wrong. Please see ARK’s general social media disclosures, which also apply to WIP content: https://ark-invest.com/terms/#twitter. Note ARK’s Adviser disclosures at the end of this piece as well.
De novo genome assembly is becoming one of the most powerful methods to analyze genetic variation. We feel strongly that the combination of novel genome assembly algorithms with accurate, long-read sequencing (LRS) technology will transform genomics. Recently, the Telomere-to-Telomere (T2T) Consortium used this approach to produce — for the first time ever — the entire end-to-end sequence of a human genome.
Broadly, genome assembly is a computational task that involves stitching together small sequence reads into full genomes — similar to solving a jigsaw puzzle. These algorithms come in two flavors: (1) reference-based assembly and (2) de novo assembly. Reference-based approaches use a reference genome (Fig. 1, Left) as a guide to align sequence reads to their corresponding loci. Despite their speed, reference-based algorithms tend to generate fragmented, lower-quality assemblies. Because of this limitation, the genome assembly field has converged on de novo implementations.
De novo assembly algorithms do not rely on a reference genome. Using the puzzle analogy — these algorithms solve jigsaw puzzles not by checking against a completed image (the reference), but by comparing individual puzzle pieces and finding pairwise relationships between them. Think for a moment about how you’d try to solve a 1,000-piece puzzle without a reference image. Would you look at the pieces’ shapes or colorations?
Many de novo assembly approaches (e.g. overlap layout consensus (OLC)) are slower and more resource-intensive because they require an all-to-all read comparison step. Depending upon the use of algorithm heuristics (e.g. seeds), this comparison step’s runtime scales linearly with the number of reads. In the worst case, the runtime can grow quadratically.
Although both short and long-read sequencing data can be input into de novo algorithms, we believe that accurate, long-read data generates the highest quality results. This is especially true near ‘hard-to-sequence’ stretches of the genome, including areas of high GC-content, homopolymorphic sequences, and paralogous loci — many of which are highly medically relevant.
Importantly, de novo assembly algorithms purpose-built for accurate, long-read data are relatively young. As such, we think there’s significant headroom to improve de novo assembly on the basis of speed, compute intensity, automatability, etc. To that end, I’d like to highlight mdBG — a novel genome assembly algorithm that I’ve been poring over since the preprint came out a few weeks ago.
What Does Graph Theory Have to do With Genome Assembly? Also, What’s Graph Theory?
mdBG is a de novo genome assembly algorithm optimized for highly accurate (<4% error) long-read sequencing data. Strikingly, mdBG is ~80–340X faster and runs on ~20X less memory than other state-of-the-art methods. Despite being ultra-fast and highly resource-frugal, mdBG boasts top-shelf performance as well.
Similar to other de novo assemblers, mdBG borrows concepts from graph theory — a highly visual field of mathematics. Graphs are conceptual data structures that seek to model objects and the relationships between those objects. Individual sequence reads often are represented by nodes in the graph whereas pairwise relationships (e.g. overlaps) between reads are represented by edges, as shown below. Many algorithms work by trying to find a unique ‘path’ through the graph where each edge or node is traversed just once — ideally, this solution would yield a single, contiguous sequence of bases for each chromosome in a genome.
That’s Not Just Any Old Graph, It’s a de Bruijn (“duh-BROO-in”) Graph
mdBG does not use a standard graph approach. Instead, it uses a special type of graph called a de Bruijn graph (hence the -dBG suffix in the name). De Bruijn graphs do not use sequence reads as nodes. Instead, reads are digitally fragmented into short, overlapping subsequences of a fixed length (k). These fragments, called k-mers, can be represented as either nodes or edges. The authors of mdBG take a node-centric approach. For an unambiguous path through the graph to be traced, all k-mers should overlap their two neighbors in all but one (k-1) nucleotide base, as shown below.
One advantage of the de Bruijn implementation is that the number of nodes in the graph scales with the size and complexity of the target genome — not the number of sequence reads. However, one disadvantage is that de Bruijn graphs are very sensitive to coverage gaps and sequencing errors, which can generate ‘false’ k-mers, tangling the graph and making it more challenging to traverse. Also, storing lists of unique k-mers can be very memory-intensive. One recent study estimated that it takes 4TB of memory to store all possible 5000-mers over the DNA alphabet.
mdBG Operates in Minimizer Space
The ‘m-‘ in mdBG stands for ‘minimizer-space’. Instead of generating k-mers in sequence space, that is, over the DNA alphabet, mdBG involves a conversion step. mdBG’s authors have built rules to help users generate a set of minimizers. Though there’s no universal definition, in this case, minimizers appear like alphanumeric symbols that actually contain short sequences of nucleotide bases (~8–12 on average).
The authors give the example where each minimizer (a) begins with the base A and (b) has a length of two bases. Therefore, the full list of minimizers would be AA, AC, AG, and AT. Effectively, this is a form of data compression. By lightening the data on the front end, the oft burdensome task of creating, compacting, and traversing the de Bruijn graph is eased substantially.
As shown in the diagram below, each sequence read is encoded into an ordered list of minimizers, which are then concatenated into unique k-min-mers. K-min-mers are functionally the same thing as k-mers but are made up of minimizers instead of DNA letters. Importantly, not all bases in the reads are contained in minimizers. It’s my understanding that these subsequences are held in base-space and appended back to the final genome assembly. We’ll come back to this later.
Next, the k-min-mers follow a similar solution path to that of a standard de Bruijn graph implementation. K-min-mers are stored as nodes whose overlaps in minimizer space are represented as edges (Box 4, Below). mdBG creates, compacts, trims, and attempts to traverse the de Bruijn graph in minimizer space. Once a solution is produced, the consensus sequence of k-min-mers is resurfaced into base-space following the original encoding schema that was used to create the minimizer set in the first place (Box 5, Below). Finally, the remaining DNA letters not captured in minimizers are added back to the assembly. The end result is a series of long, contiguous stretches (contigs) of the genome in base-space, which can be sent to any number of subsequent pipeline modules for further analysis. I’ve attached an overview below and left the discussion of error handling to the following paragraph.
1 = Import Sequence Reads
1.5 = Store Subsequences Not Contained in Minimizers
2 = Parameterize Minimizers
3 = Convert Reads to k-min-mers (here k = 3)
3.5 = Execute POA Error Correction if (1% < Error Rate < 4%)
4 = Construct, Compact, and Traverse de Bruijn Graph in Minimizer Space
5 = Decode Minimizers and Append Subsequences
6 = Final Assembly (Contig)
Sequencing errors that occur outside minimizers (contained in Box 1.5 above) do not propagate into minimizer space. However, they are appended back to the final (Box 6) assembly, necessitating further polishing. Errors that occur inside minimizers can generate false k-min-mers, requiring an additional error correction step that occurs in minimizer space prior to graph construction. Currently, mdBG’s error tolerance is <1% without error correction and <4% with error correction. Sequencing data with an error rate >4% causes a precipitous decline in assembly performance.
So What Happens Now?
Algorithms like mdBG make it easier for resource-constrained researchers and laboratories to generate high-quality genome assemblies across a range of organisms. This helps level the playing field with more well-funded entities in producing high-impact research.
Though dominated by reference-based approaches, de novo genome assembly may become more common in some clinical sequencing pipelines. Specifically, we think whole genome sequencing (WGS) — especially for diagnosing rare diseases — is an excellent beachhead use case. Labs likely will optimize for variant discovery across all categories including single-nucleotide variants (SNVs), insertions/deletions (indels), and structural variants (SVs). Still, assembly-based variant calling isn’t yet strictly superior to reference-based variant calling. However, given the rates of improvement for highly accurate long-read sequencing and assembly algorithms — we think de novo assembly (especially diploid assembly) could generate the highest diagnostic yields for inherited diseases. As such, algorithms like mdBG make it more likely that these methods can be implemented more widely, including care settings with fewer resources.
©2021, ARK Investment Management LLC. No part of this material may be reproduced in any form, or referred to in any other publication, without the express written permission of ARK Investment Management LLC (“ARK”).
The information provided is for informational purposes only and is subject to change without notice. This WIP piece does not constitute, either explicitly or implicitly, any provision of services or products by ARK, and investors should determine for themselves whether a particular investment management service is suitable for their investment needs. All statements made regarding companies or securities are strictly beliefs and points of view held by ARK and are not endorsements by ARK of any company or security or recommendations by ARK to buy, sell or hold any security. Historical results are not indications of future results.
Certain of the statements contained in this WIP piece may be statements of future expectations and other forward-looking statements that are based on ARK’s current views and assumptions and involve known and unknown risks and uncertainties that could cause actual results, performance, or events to differ materially from those expressed or implied in such statements. The matters discussed in this presentation may also involve risks and uncertainties described from time to time in ARK’s filings with the U.S. Securities and Exchange Commission. ARK assumes no obligation to update any forward-looking information contained in this WIP piece. ARK and its clients as well as its related persons may (but do not necessarily) have financial interests in securities or issuers that are discussed. Certain information was obtained from sources that ARK believes to be reliable; however, ARK does not guarantee the accuracy or completeness of any information obtained from any third party.