Benchmarking state-of-the-art secondary variant calling pipelines
Writer’s note: Most of the work presented here occurred in early 2020. Since then, there have been developments in the community such as the GIAB v4.2 release. We mention some of these in the text, and plan to publish a follow-up report with research updates and fresh data points.
We have collaborated extensively with researchers around the world who are analyzing genomics and high-throughput sequencing data. A question we are often asked by scientists pertains to the performance of variant analysis pipelines. We want to share some of our experiences with these tools with the community. In this report, we explore the capabilities of state-of-the-art germline variant calling pipelines for SNPs and small INDELs. We look specifically at their accuracy and runtimes under default testing conditions.
Genetic variant baseline for seven genomes by the Genome in a Bottle (GIAB) Consortium
Building a genetic variant “truth set” has been challenging. This is mostly due to the lack of orthogonal evidence, but also because the reads from current sequencing instruments still have unavoidable error rates ranging from 0.1% to 1%. In 2015, the Genome in a Bottle Consortium (GIAB), led by NIST, released high-confidence SNPs, INDELs, and homozygous reference regions that were generated with various sequencing technologies. GIAB refined genotype calls for each genome, and minimized bias toward any particular sequencer by aggregating calls from multiple technologies¹. They sequenced DNA from seven individual genomes: the Coriell cell line GM12878 (HG001)², the Ashkenazi Jewish trio (HG002, HG003, and HG004), and the Han Chinese trio (HG005, HG006, and HG007).
For the analysis reported here, we used the high-confidence regions from NISTv3.3.2 as variant “truth sets” for all seven samples. Each whole exome sequencing (WES) sample has 31,343 to 42,449 SNPs, and 2,208 to 4,843 INDELs overlapping with the exome captured regions. Each whole genome sequencing (WGS) sample has 2,882,677 to 3,209,316 SNPs, and 390,158 to 499,697 INDELs.
We downloaded raw HG001 WES reads from NIST FTP site with ~141X on-target coverage. Raw WES reads for HG002, HG003, HG004, and HG005 were downloaded from the SRA study, SRP047086, where each sample has around 200X on-target coverage. All seven WGS samples were generated by Illumina TruSeq (LT) DNA PCR-Free Sample Prep Kits. We downloaded the data from NIST’s FTP, followed by downsampling to 35X coverage from their original 100X–300X coverage. All samples were sequenced by Illumina HiSeq 2500 as paired-end reads.
State-of-the-art germline variant callers and integrations on DNAnexus
Table 1 summarizes the variant callers we evaluated and includes details about the versions and instance types used, and the algorithms implemented. For NVIDIA Clara Parabricks Pipelines (pbdeepvariant and pbgermline) and Sentieon (sentieon_fastq_to_vcf), we used off-the-shelf DNAnexus apps that are available in the tools library. DNAnexus apps are available as containers with dependencies installed that can be run on instances in the cloud environment. We packaged additional apps — BWA-MEM + GATK and Strelka2 — for the analysis. All apps were implemented to run on a single AWS instance.
We processed all five WES and seven WGS samples through each FASTQ to GVCF pipeline using the default parameters. Each analysis included mapping, marking duplicates, and variant calling. Base recalibration and default filtering were optional steps. We called variants against both GRCh38 (GRCh38 primary contigs + decoy contigs, but no ALT contigs nor HLA genes) and hs37d5 builds. For WES samples, we provided the captured interval files in the BED format specifying target regions for hs37d5. We used liftOver to convert them to hg38 build.
Accuracy: Genetic variant benchmark best practices
We followed the benchmark standards established by the Global Alliance for Genomics and Health (GA4GH) Benchmarking Team³,⁴. We used Haplotype Comparison Tools (hap.py) to measure recall (ratio of true positive calls among the truth set) and precision (ratio of true positive calls among all called variants), by comparing variants called to GIAB’s high confidence benchmark set. For this report, we used the PASS variants filtered by each variant caller’s default setting.
We discovered that the HG001 WES sample has notably worse accuracy metrics across all variant callers and in both reference builds, compared to those of the other WES samples (details discussed here). Suspecting problems with sample quality, we excluded it from the rest of the benchmark. For each of the other four WES samples and all seven WGS samples, we summarized and reported a false negative rate (FNR, defined as 1-Recall), and false discovery rate (FDR, defined as 1-Precision) (Figure 1–2).
All six variant calling pipelines performed competitively (with average > ~99% for recall and precision) for calling SNPs and INDELs in both WES and WGS samples. While the results show some variant callers provide a slightly better accuracy than others, this may be due to limitations with sample data collection. We used samples collected from a single lab, one hybrid capture protocol for WES, and one type of library prep from one version of a short-read platform. Also, there are variables for fine tuning filtering thresholds that can impact recall and precision rate. In our experience, researchers who are interested in the performance of pipelines on particular genome categories or loci weight those areas more than genome-wide accuracy.
We observed significant differences in all metrics when we called the samples against GRCh38 vs. hs37d5. We expected this since GRCh38 includes more sequences from repetitive regions, and many patches that may not exist in hs37d5. But while GRCh38 is more complete, it can pose more challenges for variant callers in difficult regions of the genome⁵. For example, by looking at the GA4GH/GIAB genome-stratification⁶ results from hap.py, we found a 3- to 4-fold enrichment of false negative and false positive calls in difficult-to-map regions in GRCh38 compared to hs37d5 in WES samples. We suspect that the performance difference between the two builds is associated with their genome context. Newer versions of the truth set generated using linked- and long-read sequencing technologies, and improved pipelines could mitigate some of these differences (see PrecisionFDA Truth Challenge V2).
We calculated the turnaround time for each pipeline by subtracting the start time of the analysis from the stop time. This timeframe covers the input/output from an S3 bucket to an EC2 instance, environment setup, and software processing time. Figure 3. presents the F1 scores for SNPs and turnaround time for each pipeline. Note that parabricks_deepvariant and strelka2 pipelines start from intermediate mapped BAM input. We inferred parabricks_deepvariant’s runtime from parabricks_germline’s and we skipped presenting strelka2’s runtime. We found that most pipelines have notably faster turnaround times compared to GATK.
There are many factors that directly impact the accuracy of variant calls and reported turnaround times, and our benchmark only shows a limited set of scenarios. A few important ones to keep in mind are:
- Not all software utilizes the same hardware. As shown in Table 1, we intentionally selected instance types with a similar range of hourly rates based on AWS’ on-demand list price as of July 16th, 2020. We used different hardware (CPU and GPU) and those rates are determined by the cloud market, which can be dynamic. If different instance types are used, the turnaround times will change. Pipelines may be faster or slower if run on an instance with more or fewer cores, though the speed may or may not correlate linearly with the number of cores.
- Optimization is an on-going process. Implementing efficient apps requires attention to details. To get the fastest possible runtimes, we worked with our app vendors to optimize compute resource utilization (i.e. cores, memory, storage) on the instance. There will always be room for improvement, and we continue to work on developing the most efficient software implementations.
- Variation of parameter combinations. For the software used in this report, we only show the performance results obtained with the suggested default parameters. The actual runtimes can vary significantly depending on the use cases. For example, we have seen reduced runtimes for most software when the output provided is just variant positions in a VCF, rather than reporting all sites in the genome in gVCF format.
We hope this report serves as a reference resource for what runtime and accuracy look like in typical use cases. As indicated in the text, different variant callers may be better suited for different domains and situations. If you have a particular use case that you would like to discuss, please reach out (via email@example.com). We would be happy to talk with you.
We will continue to work with variant caller vendors and provide updates and new versions of their software. The field is moving quickly. As we worked on this report, GIAB released a new version of benchmark set for the Ashkenazim trio on GRCh38 build (NISTv4.2), which addressed challenging variants, such as segmental duplications, by utilizing short-, linked-, and long-reads mapping-based variant calling and assemblies of the MHC region. We look forward to the release of the pipeline improvements based on the results of the PrecisionFDA Truth Challenge V2, and GIAB’s v4 benchmark set with all seven NIST samples.
We thank NVIDIA Clara Parabricks team, Ankit Sethia and Seth Onken; and Sentieon’s Rafael Aldana for their feedback and collaborative work that allow us to demonstrate and feature their tools and the performance metrics.