Towards a Truly “Personalized” Genome Part 2:

De novo assembly and haplotype phasing of diploid human genomes using long High-fidelity reads and non-trio phasing approaches

DNAnexus Research Lab

Contributors: Arkarachai Fungtammasan, Jason Chin

De novo assembly and haplotype phasing of diploid human genomes

The world of genome assembly has moved into a new frontier since our last blog post. Two recent publications in Nature Biotechnology (Chromosome-scale, haplotype-resolved assembly of human genomes & Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads) demonstrate how to assemble and fully phase human genomes. In this blog post, we would like to discuss some of the progress that the community has made in recent months along these lines. Specifically, we provide a high level intuition about why the read properties (length and accuracy) and non-trio-based phasing methods are important for genome assembly and can ultimately lead to high resolution human genomes for precision medicine. We discuss the methods used in both the Garg et al (2020) and Porubsky, Ebert et al (2020) publications, but emphasize the background knowledge related to Garg et al since we were involved in the research.

High Fidelity Long Reads and Repeats

High accuracy medium sized reads such as Circular Consensus Read (CCS) were introduced at the similar time as Continuous Long Reads (CLR) by Pacific Biosciences when the company debuted the first single molecule DNA sequencer in 2011. However, the first CCS reads were not very long and their utility was limited. As a result, CLR data has been the major workhorse for genome assembly. It works by generating intramolecular consensus through the hierarchical genome assembly process (e.g., Falcon, Canu). It was only recently when PacBio began generating much longer CCS reads (around ~15kbp) that they are now being widely adopted for genome assembly under new branding — High Fidelity Long reads (HiFi). The initial promise of HiFi reads for genome assembly is that they offer an order of magnitude lower compute cost due to a reduced need for error correction prior to the assembly, and for polishing the draft genome after the assembly is complete. These tasks account for roughly 80–90% of the total compute requirement.

In our previous blog post, we mentioned that read length is crucial for resolving repeats in genome assemblies by anchoring or bridging with the non-repetitive flanking regions of the repeats. However, it is possible to correctly assemble repeats that are not 100% identical if the reads are long and accurate enough to span multiple markers (different sites between repeats) (Figure 1). Depending on the repeat structure, there are regions that could be resolved by either noisy ultra-long reads or long and high accurate reads. Both technologies are good complements for creating more complete genome assemblies.

Figure 1: The interplay of read length and accuracy to resolve repeats.
Figure 1: The interplay of read length and accuracy to resolve repeats. (Source: Adam Phillippy, Plant and Animal Genomics conference, 2020)

In the figure above, the arrows R1, 2, 3 denote high similarity repeats in the genome with vertical marks denoting marker mutations found in some, but not all three repeats. The markers at the same position on the repeats are the same in this example. The horizontal lines with colors are non-repetitive flanking regions. The vertical bars in raw reads denote the mixture of true marker mutations and sequencing errors. Ultra-long Nanopore reads can resolve repeats by anchoring with flanking regions. HiFi data could also resolve the repeats if the marker mutations could be spanned by the read length. Illumina and CLR data, however, could not fully resolve these three repeats because they are too short to span multiple marker mutations and too noisy to confidently detect the marker mutations respectively.

The higher accuracy of HiFi data enables the application of various algorithmic decision procedures used in new assemblers like Peregrine, Hifiasm, HiCanu, and IPA, which is not feasible with high error reads. Using these algorithms further lowers computing costs, supports greater continuity of the assembled genome, and improves the ability to phase large regions of low heterozygosity.

Non-trio-based Phasing

Diploid organisms like humans inherit one set of homologous chromosomes each from both parents. Phasing is used to determine which variants come from the same physical chromosome, and hence from the same parent. Using parental information to bin the reads prior to assembly (trio-binning) is convenient and can indicate which haplotype came from the father and which came from the mother. The challenge, however, is that parental information is not always available or may not be accessible in future clinical applications.

An early breakthrough in algorithms for non-trio phasing is the Falcon-Unzip method which profiles local variants and then aligns long reads to those variants to phase them. This framework is also adopted by the Falcon-Phase methodology which expands the phasing block size by using Hi-C data to sort the local phasing block from the same parent into the same groups. These tools were designed to work with the PacBio CLR dataset, and both assembly and initial phasing are handled by the same tool. In addition, there are newer tools that jointly perform phasing and assembly using HiFi data such as IPA, Hifiasm, and HiCanu. We will reserve our discussion about these methods for another blog post.

Recent research on phased genome assemblies

Recent publications on phased genome assemblies
Figure 2: Recent publications by Garg et al (2020) and Porubsky, Ebert et al (2020) describing research on phased genome assemblies

Two recent publications (Figure 2; Garg et al (2020) and Porubsky, Ebert et al (2020)) describe new methods for creating high continuity, high accuracy, and fully phased human genome assemblies. The methods described in both papers share one interesting algorithmic decision, a procedure that we refer to as assemble-phase-assemble. First, an unphased assembled genome is created, so that the reference genome is not required. Second, reads are mapped to this unphased (“squashed”) assembly and binned based on haplotype. In both publications, the authors reported that they could also bin the reads by chromosome and phase haplotypes along entire chromosomes. These are both major advantages over previous methods such as Falcon-Unzip. The final assembly is created from each group of binned reads. The two main benefits of this paradigm are:

1) The phasing is performed on the de novo assembled genome rather than a typical reference-guided assembly.

2) The phasing algorithm is independent of the assembly algorithm. This lets the user choose which algorithms they want to use. They could even pick different algorithms for unphased and phased assemblies.

Decoupling these two steps also simplifies the job of algorithm developers. There are, of course, risks to decoupling these steps as incorrect binning could lead to a misassembled genome in the final step (See discussion by Haoyu Cheng and Heng Li), but the benchmarks reported in these two publications show that the results of phasing using this approach are highly accurate.

In Garg et al (2020), the authors used Hi-C data to scaffold the assembly into 23/24 chromosomes. This way, mapped reads to these chromosomes could be processed independently. Then phasing is done using WhatsHap and a combination of local phasing signals and long-distance phasing information from Hi-C + HapCUT2.

The usage of Hi-C data is an interesting choice. The data is not too complicated to generate and could be used for both scaffolding and phasing. The scaffolding of the unphased assembly is also important for this data type since the assembly is performed on each cluster of reads. Without this step, the cluster would not represent the whole chromosome. Thus, it would not be whole chromosome phasing unless we use the reference genome to guide which clusters belong to the same chromosome. It is also important to note that the heterogametic sex chromosomes are not properly phased at this moment, but this is something can be solved in future studies.

In Porubsky, Ebert et al (2020), the segregation of reads into haplotype and phasing is enabled by a special sequencing protocol called Strand-seq. This method tracks the template strand after mitotic cell division by using a thymidine analog. By using the patterns of the inherited strand among multiple single-cell sequencing, a separate tool, SaaRclust, bins them into chromosomes. Then a combination of Whatshap and information from Strand-seq are used to phase the reads. The major challenge of this method is the difficulty of generating Strand-seq data, which is not yet commercially available. However, since phasing is guided by Strand-seq, the initial assembly contigs do not have to be ordered into complete chromosomes. Furthermore, the method has been shown to work for all types of long read sequencing data (Oxford Nanopore data, PacBio CLR, and PacBio HiFi).

HLA and KIR3DL3 assembled genes (screenshot)
Figure 3: A screenshot from Garg et al 2020 showing assembled HLA and KIR3DL3 genes

The Impact

High-quality phased assembled genomes allow us to study complex human genes at unprecedented resolution. Figure 3 above from Garg et al., shows fully phased HLA (the first trigger for innate and adaptive immune response) and KIR3DL3 (an important receptor of killer T cell) genes. These genes are not only complicated to assemble, but also very diverse. Thus, phasing is a crucial tool for properly characterizing them for functional studies or clinical diagnosis.

Both Garg et al (2020) and Porubsky, Ebert et al (2020) are important landmarks in deciphering human genome diversity. In an ideal world, we would be able assemble and phase every base on all chromosomes in the human genome. Currently, there are initiatives that are working on making this vision a reality. The Telomere-to-Telomere (T2T) consortium, for example, is working on the Telomere-to-Telomere haploid CHM13 genome semi-manually that could lead to a more completed reference genome. As part of this work, they are developing new automatic methods of assembling the genome. We are optimistic that the Telomere-to-Telomere phased and assembled diploid genome should be coming soon. We think it will have equivalent or bigger impacts than the current genome sequencing model that only identifies small variants from a common reference.

Assembling the genome in the right way is not quite simple and economical yet. But we get much more resolution with phasing for complex genes than short-read technology offers. Easier data generation, active algorithm development, and the benefits that accrue from knowing the sequences of these complex and diverse genes would help accelerate new paradigms in the genomics and biomedicine.

Consortiums like Human Genome Structural Variation Consortium and Human Pangenome Reference Consortium are major driving forces in this regard but there are ample opportunities for research and development. Some of the questions that still need to be answered are what kind of data/algorithms are needed to create personalized assembled genomes in future? When might these become a reality? Could we understand the implications of all variants and make this information available to people through smartphones? We invite you to join our community and help us make this vision happen.

We would like to thank Tobias Marschall and Heng Li for their comments and suggestions to improve this blog post. Thanks to Adam Phillippy for Figure 1.

--

--

Arkarachai (Chai) Fungtammasan, PhD
DNAnexus Science Frontiers

Genomics & Bioinformatics Researcher in Silicon Valley startup. Special interest in emerging NGS technology, ML, and large scale data processing