HiFi Assembler Series, Part 1: hifiasm, a fast, haplotype-resolved genome assembler

Published in

PacBio

4 min readFeb 12, 2020

UPDATE: Hifiasm preprint is out. Currently under review!

NOTE 1: This is a four-part series on genome assemblers using PacBio HiFi reads.

NOTE 2: Hifiasm is currently under active development. As such, methods & results described in this interview may become obsolete in the future. I will attempt to keep the contents of this blog updated.

Drosophila melanogaster (fruit fly), one of the species assembled with hifisam. Photo credit by Hannah Davis.

Since its release in 2019, the bioinformatics community has been developing new tools that take advantage of the highly accurate long reads, known as HiFi reads, for genome assembly and variant calling. At the Plant and Animal Genome (PAG) Conference 2020, we saw the introduction of HiCanu (by @sergekoren), Peregrine (by @infoecho), and Nighthawk-Falcon (by the PacBio assembly team), all employing HiFi reads towards a haplotype-resolved genome assembly.

As PAG came to a close, hifiasm joined the growing family of HiFi assemblers.

One key focus of hifiasm is to ensure that reads from different haplotypes are separated for the error correction step. This is only possible now with the HiFi accuracy; Heng thinks it is very challenging for this approach to work on traditional long reads that have an error rate over 10%. Currently, haplotyping is done using only SNPs, but Haoyu thinks there is a possibility to haplotype based on indels in the future.

Part of the attraction of hifiasm comes from its versatile outputs. It outputs the overlaps files, which enables users to fiddle with different assembly parameters without having to re-do the time-consuming overlap step. It outputs the assembly in GFA format (ex: Figure 1) as well as primary/alternative contigs similar to FALCON-Unzip’s output.

***Figure 1. Bandage plot showing the assembly graph for F1 drosophila using hifiasm.*** *The red arrow points to a likely mis-joining of telomeres between two chromosomes.*

There are still remaining challenges, however. Heng points out, for example, that in the F1 drosophila assembly (Figure 1), there is likely a mis-join of telomeres between two chromosomes (red arrow), where the other haplotype of the second chromosome became separated (blue arrow) from the other arm, likely due to high heterozygosity. In contrast, a medium level of heterozygosity of one of the other chromosomes (green arrow) resulted in a series of “bubbles”. Having a universal assembly graph for all levels of heterozygosity, thus, is the remaining problem Haoyu and Heng are trying to solve.

Below are additional Q&A with Haoyu and Heng from our interview.

Liz: Would hifiasm work on polyploid organisms?

Heng: Yes and no. Hifiasm generates unitig graphs (*_utg.gfa) with no assumption on ploidy. It aims to faithfully represent input data with no/little loss of information. This graph can encode more than two haplotypes resulted from polypoid organisms, different strains or clonal somatic mutations. However, some subsequent graph operations assume the input sample is diploid.

Liz: Do you think hifiasm will do well with highly repetitive genomes like maize?

Heng: One of the advantages of HiFi reads is the high accuracy. As long as the repeat length is shorter than the read length, we would be able to tile over heterogenous repeats.

Liz: How is hifiasm different from miniasm?

Heng: Hifiasm shares some source code with miniasm, but the key difference is read overlaps in miniasm can be inexact, whereas hifiasm requires nearly identical overlaps, which dramatically speeds up the overlap step and resolves local phasing.

Liz: What are some of the remaining challenges you are working on?

Haoyu: It remains difficult to distinguish segmental duplications and certain repeats; as a result, both copies may end up in the primary contig. We want to find a way to “pop” these bubbles.

Heng: Long term we would like to integrate other data types such as ultra-long reads and Hi-C data so we can achieve chromosomal-scale assembly.

Liz: Taking a step back to look at the big picture…what are your thoughts on the community moving towards pan-genome projects? How can researchers benefit from pan-genomes?

Heng: I’ve written two blog posts on this (part 1, part 2). I think different researchers will have different preferences, but the majority of users may still stick with a linear reference (with blacklisted regions). Others may brave graph-based mapping.

HiFi Assembler Series, Part 1: hifiasm, a fast, haplotype-resolved genome assembler

Written by Liz T