Towards a Truly “Personalized” Genome:

Part 1: De novo assembly and haplotype phasing diploid of human genome using noisy ultra-long read and trio-binning approach

Arkarachai (Chai) Fungtammasan, PhD

Published in

DNAnexus Science Frontiers

9 min readJun 10, 2020

Contributors: Arkarachai Fungtammasan, Junzhou Wang, Chiao-Feng Lin, Naina Thangaraj, Jason Chin

We plan to have a series of scientific blog posts about the personalized genome. This concept has been around for decades, so why do we want to discuss this again? As opposed to personalized genomes based on whole-genome shotgun sequencing and variant calling using a reference genome, we will focus on de novo assembly of an individual human genome which just became economically feasible recently. This approach is more suitable to discover large and novel structural variants or avoid bias from the reference genome.

The challenge of variant calling from the assembled genome is the heterozygous regions, which are usually collapsed during the assembly and obscures the variant calls. Falcon-unzip (https://www.nature.com/articles/nmeth.4035) is one of the first tools that has been developed to report the primary contig and haplotype contig assemblies on high heterozygous regions. The variants on haplotype contigs are also phased, determining alleles assortment of multiple heterozygous loci on the same haplotype contig. Besides calling variants, genomic phasing is also critical to study allele-specific regulation, compound heterozygous mutations, or population genomics. For these reasons, the de novo assembly and phasing have become increasingly important in the study of personalized genomics and uncover human large scale structural variation.

In this series of blog posts, we will explore several sequencing technologies and assembly & haplotype phasing methods that we have experimented with.

Trio-binning using ultra-long reads

In this first blog post, we are interested in the ultra-long Oxford Nanopore Technology (ONT) data. These are ONT data that were generated using special sequencing protocol [1,2] to increase the read length. In general, the longer read length is helpful in spanning across repetitive regions of the genome which would increase the assembly continuity. For this particular work, we are interested in using ultra-long reads to increase phasable read percentage with a trio-binning approach. The trio-binning approach for long read phasing and genome assembly was first described in 2018 [3] to use parental specific k-mer to separate long reads data into paternal, maternal, and unknown prior to the assembly. This method has been widely used for de novo genome assemblies of many vertebrate species using either PacBio Continuous Long Read (CLR) [4] or Circular Consensus Sequencing (CCS) mode [5]. In the human genome, the heterozygosity is low, so the percentage of reads that overlap with parental specific k-mer and can be classified as paternal and maternal reads are about 60–70% [3,5].

Our main interest is to increase phasable read percentage by using ultra long reads which would increase the probability of matching with parental specific k-mer. In particular, could we create fully phased assembled genomes if the reads are ultra-long?

Figure 1: The trio-binning and assembly of ultra-long ONT reads

We gathered ultra-long ONT data from UCSC ( https://s3-us-west-2.amazonaws.com/human-pangenomics/agbt2019/fastq/GM24385.fastq) and Genome In A Bottle consortium (GIAB) [6] of HG002 individual and filtered to retain only reads that are at least 40 kbp (Figure 1). Using the parental specific17-mer to classify data, we could phase almost 100% of the reads (See Supplemental Note).

We assembled paternal and maternal reads into two 3 Gbp genomes using WTDBG2 with NG50 at 6.08 Mbp for paternal assembly and 8.69 Mbp for maternal assembly. As a quick check for phasing accuracy, we use the informative SNPs from our earlier work [7] to determine if the phased SNPs are accurate (Figure 2). We found that the phasing accuracies of long assembled contigs are slightly higher than the trio-binning assembled genome using shorter but higher accuracy PacBio CCS.

Figure 2: Phasing accuracy inferring from ratio of parsimoniously assigned SNVs using trio information. Blue plus symbol is this study, while red cross symbol is Wenger 2019 study

Polishing Benchmarking

We further improved the quality of assembled genomes by aligning paternal and maternal ultra-long ONT reads back to assembled genomes and polishing with various algorithms. Our polishing strategies include WTDBG2-POA v.2.4, Medaka v.0.7.1, MarginPolish+HELEN v.1.1.0, and combinations of these three tools on both assembled paternal and maternal genomes. We evaluated these polishing results in terms of continuity (NG50 and number of contigs), completeness (complete Vertebrate BUSCO percentage), and accuracy (Qualimap2 mappability and error rate).

Overall, the WTDBG2-POA followed by the HELEN has the highest quality across all three categories (Figure 3 and Table 1). Other interesting observation includes:

All the polished genomes are smaller than the original assemblies with the median of 2.8 Gbp (Table 1).
The WTDBG2-POA tool does not affect the total number of contig, while the rest alters the contig number (Table 1).
Both Medaka and HELEN can significantly increase the BUSCO score from 62–63% to 82–83% in one round of polishing (Figure 3B)
The Medaka tool significantly reduces contig continuity (Figure 3A). The quality of assembled genome also gets worse when combined with WTDBG2-POA either before or after Medaka polishing (Figure 3C)

A

B

C

Figure 3 Benchmarking polishing tools in terms of continuity, completeness, and accuracy. A) The assembly continuity measured by NG50 B) The assembly completeness measured by complete BUSCO of Vertebrate C) The assembly accuracy measured by mappability and error rate of parental specific trio-binned CCS reads. The higher mappability and lower error rate are considered higher accuracy.

We also inspected the WTDBG2-POA + HELEN using the K-mer Analysis Toolkit using both trio-binned CCS reads for paternal and maternal polished assemblies and unpartitioned HG002 Illumina reads (Figure 4). In both cases we found a large proportion of k-mer are not included in the final assemblies. Although our initial assemblies look promising, and the polished assemblies have reasonable BUSCO scores, we need more work to create higher quality genomes.

Figure 4: KAT plot of maternal haplotype assemblies and both assembles against partitioned maternal CCS and unpartitioned Illumina reads respectively.

Caveats

Our work here demonstrates an interesting opportunity to get a personalized genome utilizing the latest sequencing technology. However, as these technologies and approaches are relatively new, we think it will help the community to look at our current approach critically. These haplotype phasing and assembled genomes look promising, but still far from high quality assembled genomes. What could we improve to create higher quality assembled genomes?

Binning

Although we were able to partition almost all the ultra-long reads and the long contigs in assembled genomes have high phasing accuracy, this does not guarantee that the phased reads are correctly partitioned. In the human genome, there are long stretches of homozygous regions over 50 kbp or even 500 kbp. Reads that came from those regions should be classified into neither paternal nor maternal reads. Theoretically, the ultra-long reads we used in this study should increase the phasable reads percentage, but a certain percentage of reads should remain in unknown due to the homozygous region in the human genome. The bigger concern is how many reads were classified into the incorrect parents of origin. We suspect that the sequencing error is one of the main reasons for misclassification, and we would need a better algorithm to accurately classify the noisy data.

Assembly

We chose WTDBG2 in this study because it was an economical choice and popular assembler when this project started. However, other assemblers including Flye, Shasta and a newer version of Canu are now available. It would be interesting to test if we can improve the assemblies further.

Polishing

The main concern about the polishing results is whether the reads after partitioning have sufficient sequencing depth to be optimally utilized by the polishing algorithm. Usually, the recommended sequencing depth is 50X for noisy reads. In this experiment, we have around 30X sequencing depth for each haploid after binning. It may be interesting to start with higher sequencing depth data and downsampling to test performances of each polishing tool with different sequencing depth.

These issues would need further investigation in the up-coming blog post which we will deep dive into some of those issues and explore other sequencing technologies and algorithms that might provide higher quality assembled genomes. Stay tuned!

Final Note About the Project

This project is part of the science frontiers program at DNAnexus which we allocate 10% of time for our scientists and engineers to collaborate in various scientific and technological fields outside the primary responsibility. These blog posts are personal opinions and experiments. The DNAnexus company disclaims responsibility for any analyses, interpretations, or conclusions.

Supplemental Note

Binning and Assembly

We retrieved the ultra-long ONT data from UCSC [8] and Genome In A Bottle consortium (GIAB) [6] for a HG002 individual on 21 May 2019 and filtered to retain only reads that are at least 40 kbp. Assuming a human estimated genome size of 3 Gbp, we have 67x sequencing depth with N50 at 85 kbp after filtering by length. The trio-binning was performed using the example code from https://github.com/skoren/triobinningScripts. The 51x and 49x sequencing depth Illumina data of father and mother of HG002 were retrieved from GIAB repository [6]. The 17-mer sequence for these two Illumina datasets were quantified using meryl. The two sets of 17-mer were subtracted from each other and then the 17-mers that were present less than 5 times were removed. This leaves 9.8M and 12.7M paternal and maternal specific 17-mer respectively. Each ONT read was evaluated for the ratio of matched k-mer over the overall parental specific and classified based on majority voted into paternal, maternal, and unknown reads. Using the described parameters, we were able to phase almost all reads with 56.65% paternal reads, 43.34% maternal reads, and <0.01% unknown reads. There are 6 unknown reads in total and they all consist of more than 40kbp near perfect repeat of either (A)n or (GT)n repeat. Overall, we concluded that the phasable read percentage is very high at almost 100%

Qualimap2 analysis

Note that rather than the Illumina reads, the mapped reads for Qualimap2 analysis came from the paternal and maternal PacBio CCS reads for HG002, which are generated from the same protocol as described above except that the 18-mer were used. The long CCS reads have higher accuracy than the noisy CLR or ONT reads and are also long enough to be segregated using 18-mer from trio-binning. We classified reads into 40.44% paternal reads, 40.89% maternal reads, and 18.66% unknown reads.

Table 1 Assembly contig length statistics from different polished results. The order of tools in each experiment reflects the order that the polishing tools were used. The NG50 for WTDBG2-POA followed by Medaka could not be estimated because the assembled genome size is smaller than half of the estimated genome size.