An exploration of machine-learning based variant callers for SNV and small-indel using PacBio HiFi data

Yih-Chii Hwang, Ph.D.

Published in

DNAnexus Science Frontiers

9 min readDec 16, 2021

Prepared by: Yih-Chii Hwang, PhD, Arkarachai Fungtammasan, PhD, and Jason Chin, PhD

Machine-learning based variant callers for SNV and small-indel using PacBio HiFi data

Abstract

In this blog, we highlight two solutions, DeepVariant by Google and DNAscope by Sentieon®, for detecting small variants using PacBio® HiFi reads. Both algorithms applied machine learning methods for genotyping based on mapped sequencing reads. We report our experience with both solutions and their performance. We also share our thoughts about long-read technology for variant calling and how machine learning helps advance the genomics field.

Genotyping using short-read and long-read

To date, many clinical genetic tests [1] are performed with high-throughput short-read sequencing (150–300 bp) such as targeted sequencing, whole-exome sequencing (WES) and whole-genome sequencing (WGS) for detecting single nucleotide variants (SNVs) and short insertions/deletions (INDELs). The bioinformatics analysis for calling these genetic variants has played a major role as the standard practice in genotyping individuals. In addition, many variant calling tools have been applied for population-scale studies. For example, TOPMed calls variants of WGS data for ~158,000 samples using GATK HaplotypeCaller [2]; UK Biobank WES consortium selected DeepVariant [3] for variant detection of WES data among ~500,000 participants; and All of Us chose DRAGEN-GATK [4] for genotyping of up to 1,000,000+ individuals with WGS data sets. Numerous studies (including those from us) have compared different variant calling tools for their relative performance. Many of these tools achieved high-quality results — with an average ~99% for precision and recall, suggesting both the short-read sequencing data and bioinformatics software are reliable.

In the last decade, long-read sequencing data (10,000–100,000 bp) offers a great scaffold for de novo assembly, haplotype phasing, structural variant detection, RNA isoform detection and metagenomics [5]. However, it is not typically used for detecting small variants as short-read next generation sequencing, due to its high sequencing errors and higher cost of sequencing. Recently (in 2019), PacBio® introduced HiFi sequencing (with CCS Phred quality score > 20), where they improved the base calling accuracy (>99.9%) by combining multiple consecutive observations (subreads) of a DNA molecule [6, 7] and making it possible for high accuracy genetic variant calling [8, 9]. Last year (2020), in the latest precisionFDA truth challenge V2, PacBio® HiFi data surpassed other sequencing technologies in detecting small variants in terms of precision and recall. Only the combinations of reads from multiple sequencing technologies outperformed PacBio® HiFi.

The two winners of calling variants using PacBio® HiFi data

DeepVariant and DNAscope both won the precisionFDA challenges under the PacBio® HiFi data category. One year after the challenge concluded, we got access to these tools and data to evaluate their performance metrics. We are sharing our first-hand user experience in this blog.

DeepVariant

DeepVariant is an open-sourced project developed and maintained by the Genomics team at Google Health. It was originally developed for short-read variant calling in 2018 [10] using deep convolutional neural networks. It was later modified and released for detecting variants on PacBio® HiFi data in 2019 [11]. DeepVariant produces pileup image tensors from aligned reads, classifies each tensor using a convolutional neural network, and reports the detected variants.

DNAscope

DNAscope is a commercial software developed and maintained by Sentieon®, which focuses on highly optimized algorithms for bioinformatics applications. DNAscope was introduced after Sentieon’s DNAseq® solution (matching GATK) with an improved local assembler. It also applies machine-learning based ensemble methods for genotyping based on the INFO annotations and the genomics context (Don Freed, personal communication, Oct 15, 2021). Originally developed for short-read sequencing, DNAscope was adapted for PacBio® HiFi data for the precisionFDA challenge.

Our hands-on experience with the solutions

HiFi sequencing data from The Genome in a Bottle Consortium (GIAB)

We downloaded raw PacBio® HiFi reads of three samples, HG003, HG006, and HG007 from the NIST GIAB FTP site. We skipped the other four GIAB samples, HG001, HG002, HG004, and HG005, as they were used as training and evaluation sets for DeepVariant and/or DNAscope. Each sample has an average coverage at 41.6X, 39.2X, and 36.4X; and read length N50 at 14,977bp, 15,689bp, and 17,112bp, respectively. All the sample’s HiFi reads were generated by “Circular Consensus Sequencing” analysis in SMRT Link v8.0 (ccs 4.0.0), using the Sequel II System with 2.0 chemistry.

Pipelines for short variant calling

We aligned raw HiFi reads to GRCh38 (GRCh38 primary contigs + decoy contigs, but no ALT contigs nor HLA genes) using pbmm2 (version 1.7.0) (with parameters -c 0 -y --preset HIFI), and collected mapped and sorted reads in BAM format.

For generating variant call sets using DeepVariant, we followed the tutorial and implemented the optional 2 passes of DeepVariant (version 1.2.0) with phasing and haplotagging using whatshap (version 1.1) [12]. In parallel, we used DNAscope (version 202010.04) to perform an independent variant calling. DNAscope’s variant calling pipeline also consists of a phasing step and a second pass. These are all wrapped by Sentieon® and offered to run as a single command, dnascope_HiFi.sh. We then evaluated the precision (ratio of true positive calls among all called variants) and recall (ratio of true positive calls among the truth set) using Haplotype Comparison Tools (hap.py) against the GIAB benchmark set. We implemented each software as an DNAnexus applet and the full analysis as a workflow on the DNAnexus Platform (Figure 1.).

Figure 1. Small variant calling analysis workflow for PacBio® HiFi reads. Every box contains the software used for read alignment, variant calling (two passes), phasing, and variant benchmarking. We performed two pipelines, DeepVariant and DNAscope (dnascope_HiFi.sh) in this project.

Precision and recall

We downloaded benchmarked variants in the high-confidence regions from NISTv4.2.1 [13] as “truth sets” for all three samples. Each of the three samples has 3,269,859 to 3,327,494 SNPs, and 428,045 to 504,356 INDELs. For this report, we used the “PASS” variants filtered by each variant caller’s default setting.

Figure 2. Performance metrics for small variant calling from PacBio® HiFi reads by DeepVariant (pass1 and pass2) and DNAscope (with two passes). Top left: false negative rate (FNR) for SNPs; bottom left: false discovery rate (FDR) for SNPs; Top right: false negative rate (FNR) for INDELs; bottom right: false discovery rate (FDR) for INDELs. Smaller values of FNR and FDR (x-axes) demonstrate higher performance.

Both solutions perform well with precision, recall, and F-1 score between 0.9991 and 0.9994 for SNPs, and 0.9927 and 0.9972 for INDELs. Suggesting that machine learning based methods are providing great accuracy for calling variants (Figure 2.).

We also found out that haplotype-phasing helps performance in both precision and recall for INDELs remarkably. Between DeepVariant pass 1 and pass 2, we discovered FNR is reduced 46.9% — 59.7%, and FDR is reduced 42.8% — 57.2%. While less significant, phasing also increases the performance in SNPs, with an exception that HG007 has a higher FDR after phasing. This is expected, as phasing, in short, assigns variants discovered along the same read into the same haplotype, which comprises the variants inherited together from the same parent. Compared to short reads, each long read contains more consecutive variants for better phasing. As each candidate variant is assigned to a certain haplotype, it helps reduce any ambiguity of being mixed with the other haplotype, specifically for calling heterozygous INDELs sitting along repetitive and homopolymorphic regions.

Turnaround time

We are interested in achieving the most cost effective experience and to mimic a real use case for calling variants in a production setting. For each analysis, we chose an instance type that fulfills each software’s minimum hardware requirements that is available on the DNAnexus Platform.

We chose m5d.8xlarge for DeepVariant (with an exception of requiring to use a larger instance, m5d.12xlarge, for HG007, as the DeepVariant went out of memory), and c5d.9xlarge for whatshap. For DNAscope, c5d.9xlarge works well on all samples. For getting the most efficient runtime, we exhausted the instance of choice by using all threads available for all software that supports multi-threading (i.e. DeepVariant and DNAscope). We orchestrated the parallelization by chromosomes of whatshap as it doesn’t support multi-threading (i.e., whatshap).

We collected the end-to-end runtime of the variant calling pipelines by summing up the runtime of each pipeline, starting from mapped reads to variants being called. Each step’s runtime is calculated by subtracting the starting time of the analysis from the stop time (Figure 3.).

Figure 3. Wall-clock runtime for calling variants using mapped PacBio® HiFi reads. *: the sample (HG007) was processed by DeepVariant with a larger memory capacity instance type that comes with more threads than the other two samples (HG003 and HG006) due to memory requirements.

User experiences

We had a smooth experience setting up and running both pipelines. The documentation for the software was extremely helpful. DNAscope provided a more streamlined experience, as Sentieon® enables a single-step script (dnascope_HiFi.sh) as a single command. The command takes mapped reads (in BAM/CRAM format) as input, launches pass-1 of variant calling and phasing, as well as pass-2 of variant calling. We learned that the DeepVariant team is working on eliminating the need for users to run two passes (Andrew Carroll, personal communication, Oct 14, 2021) and we are looking forward to a more straight-forward experience and a faster turnaround time.

Summary

Long-read might become a new standard for genotyping

We found that using exclusively PacBio® HiFi reads we can detect SNPs and INDELs to extreme high accuracy (Figure 2.) which can be competitive with short-read technologies (data not shown). This suggests HiFi reads are error-less data with extremely low error rates (i.e. high base calling quality), in addition to having a significant advantage across some repetitive regions which are difficult to map with high-quality short reads. While we were working on this blog, Kishwar S et al. published an haplotype-aware variant calling pipeline that is based on a recurrent neural network for filtering candidate SNPs, and used DeepVariant for identifying variants. The pipeline has great accuracy calling variants using PacBio® HiFi data and runs faster than the DeepVariant pipeline. It is also shown the ability to achieve great performance with SNPs (F1-scores between 0.9969 and 0.9977) and INDELs (F1-scores at 0.7257 and 0.7128), by processing Oxford Nanopore Technologies’s long reads (100kb+ bp) sequencing data. Long-read sequencing technologies along with their associated post processing variant calling pipelines are showing breakthroughs in accuracy for short variant detections.

Cost, however, can be a key factor if the major purpose is to detect small variants using PacBio® HiFi data ($2K-3K per Sequel II SMRT Cell 8M with 30-hour movie on 2021/11/29 according to UW PacBio Sequencing Services). That said, long reads can support more information such as phasing, better mappability, structural variant detection, and genome assembly, which can not be easily done by short-read data. We look forward to future cost reductions as the technology becomes more broadly applied, and we are optimistic that more long-read sequencing data will be applied for genotyping in clinical practices.

Machine learning provided a fast “starting-point” for new sequencing data

While both DeepVariant and DNAscope models were first established for processing mapped reads from short-read sequencing platforms, with the knowledge gained and reusing the frameworks, they were retrained to models with great performance for long-read sequencing data. These models, which were trained previously for existing sequencing data, can be considered as “starting-points’’ for training new models when new sequencing technologies are introduced. This can cut down the effort and speed up the time required to develop a variant calling solution from scratch. We foresee more and more machine learning-based methods facilitating genotyping with new types of data.

Insights from the machine-learning models can help us build optimized methods

We recognized that DNAscope performs much faster than DeepVariant (Figure 3). This is a result from many possible factors, such as I/O, the pre- and post- processing operations, software engineering, pipeline implementation, hardware choice, machine learning models, etc. One hypothesis is that boosting-based machine learning methods utilize less parameters that are more efficient for the task than deep- learning neural networks initially designed for classifying images. We are interested in reviewing more insights discovered from the models. By understanding “how they work”, these models can help humans determine which features and parameters are essential and how we can utilize them efficiently. From these insights, we look forward to seeing more simple and concise solutions developed with an optimized set of features along with (human) explainable methods.

Challenging Medically-Relevant Autosomal Genes

In precision medicine, one of the applications utilizing sequencing technology is to detect clinically actionable genetic variants, which help guide preventive interventions and clinical decision making. GIAB has recently characterized HG002 benchmark variant sets that focus on medically relevant genes which have been challenging to identify due to their repetitiveness or polymorphic complexity and therefore low mappability [14]. We are interested in evaluating how long-read data along with promising variant calling tools help close this gap, when the benchmark set expands to other GIAB samples.

Acknowledgements

We thank GIAB for hosting and maintaining the refined benchmarked set of variants for the GIAB samples. It is only possible to train machine learning based models with truth sets being established. We thank Andrew Carroll from the Google Genomics team for collaborating with us on usage of DeepVariant. We also thank Rafael Aldana and Don Freed at Sentieon® for collaborating with us on usage of DNAscope and providing us early access to the DNAscope pipeline. All analyses were performed on the DNAnexus Platform, and are sponsored by DNAnexus.

All registered trademarks belong to their respective owners. Sentieon and DNAseq are trademarks of Sentieon, Inc. PacBio is a trademark of Pacific Biosciences of California, Inc.