HiFi Assembler Series, Part 2: HiCanu, near optimal repeat resolution using HiFi reads

Published in

PacBio

3 min readFeb 20, 2020

NOTE 1: This is a four-part series on genome assemblers using PacBio HiFi reads. Read part 1 on hifiasm here.

NOTE 2: HiCanu is currently under active development. The content of this blog is based on Sergey Koren’s PAG 2020 talk slides.

NOTE3: The HiCanu preprint is now on biorxiv.

HiCanu is the latest member in the Canu assembler family that utilizes long-read data. Based on the Celera Assembler, the original Canu was modified to work with long reads that had higher error rates by adapting a weighted MinHash-based overlapper with a sparse assembly graph. Later, TrioCanu was developed to use parental data to “bin” (separate) long reads into different haplotypes before assembly.

With the release of PacBio HiFi reads, the Canu authors developed HiCanu. Compared to PacBio CLR reads, HiFi reads are slightly shorter but more accurate (> 99% read accuracy). In his PAG 2020 talk, Sergey (@sergekoren) described three tricks to further reduce the ~1% error rate of HiFi reads.

The first trick is run length encoding (RLE), aka homopolymer compression. As the majority of HiFi read errors occur in homopolymer regions, RLE compresses homopolymers to a single base, allowing for faster overlap and error correction.

**Schematic of run length encoding** where homopolymers (HP) are collapsed for faster overlap detection and subsequent error correction. Figure provided by R Grothe.

The second trick is fixing remaining non-homopolymer errors (e.g. removing spurious errors that occur in only one read) and the third trick is ignoring mapping ambiguities arising from dinucleotide repeats.

After the three tricks, the corrected HiFi reads achieve a median pairwise alignment identity of 100%, allowing for filtering to retain only perfect overlaps and near-optimal repeat resolution.

**Three tricks (run length encoding, fixing residual errors, and ignoring systematic errors) corrected HiFi reads to near 100% identity**, allowing for perfect overlaps. Slide from Sergey’s PAG 2020 talk.

Applying HiCanu to three human HiFI datasets (CHM13, HG0733, and NA12878) resulted in the fewest number of errors against the reference compared to Peregrine assemblies with HiFi, Canu assemblies with ONT, and 10X supernova assemblies. Importantly, segmental duplications — repeat regions in the human genome of particular biological interest — were resolved at BAC resolution using 30-fold coverage of 20 kb HiFi reads. Sergey reported that >90% of the segmental duplication is resolved and suspects the remaining ones might actually be errors in the BACs.

Resolving human segmental duplication at BAC resolution using 30-fold 20 kb HiFi reads. Slide from Sergey’s PAG 2020 talk.

In addition, without polishing, HiCanu achieved >Q55 (3 errors in 1 million bases) accuracy for the haploid sample, CHM13. The higher accuracy also allowed better haplotyping in diploid samples, reducing phase switches from 0.15% using FALCON-Unzip with CLR reads to 0.03% using HiCanu with HiFi reads.

Number of remaining collapsed bases fom HiCanu assembly. Slide from Sergey’s PAG 2020 talk.

While centromeres remain a challenge, Sergey reports that near-optimal repeat solution for human-level heterozygosity is now achievable using HiFi reads. Sergey also mentioned that HiCanu works on metagenomics data, too! Applying it to a HiFi sheep rumen metagenomics dataset, HiCanu generated 126 complete genomes (compared to 61 for Canu 1.9 and 55 for Peregrine), albeit with a 15% contamination rate that the authors are working to improve in the coming releases.

As of this writing, HiCanu is available at the tip of the GitHub development branch (https://github.com/marbl/canu/issues/1601).

HiFi Assembler Series, Part 2: HiCanu, near optimal repeat resolution using HiFi reads

Written by Liz T