Accelerating Germline and Somatic Genomic Analysis of Whole Genomes and Exomes with NVIDIA Clara Parabricks v.3.6

Johnny Israeli
5 min readAug 24, 2021

--

By Eric Dawson and Johnny Israeli

As we enter the next decade of genomics, the field looks more exciting than ever. Whole-genome sequencing studies regularly scale into the tens of thousands to millions of genomes. Long sequencing reads are growing longer, more accurate, and more abundant in the community. AI based variant calling programs are now state-of-the-art for short and long read sequencing platforms. To capitalize on this vast and complex data, bioinformaticians need accelerated software that doesn’t sacrifice accuracy, AI methods that scale, and quality control methods that work seamlessly with the genomics analyses of today and tomorrow.

This is the soul of NVIDIA Clara Parabricks version 3.6, our newest genomics analysis software release at NVIDIA. This release marks our most comprehensive yet, with the addition of new variant callers, annotation and filtering tools, and quality control reports that enable fast and easy germline and somatic analysis.

A comprehensive suite of variant callers for rapid whole genome germline and somatic analysis

Figure 1: NVIDIA Clara Parabricks v3.6 has added an accelerated version of variant caller LoFreq for inferring SNVs and indels from NGS data. This addition brings the total number of somatic callers to 4 and the total number of germline callers to 3. In addition, v3.6 has an easy-to-use vote-based VCF merging tool (VBVM), a database annotation tool (VCFANNO), a new tool for quickly filtering a VCF by allele frequency (FrequencyFiltration), and 2 tools for VCF quality control (VCFQC and VCFQCbyBAM).

Parabricks employs the same underlying algorithms as the community tools but runs significantly faster by leveraging the massively parallel processing power of NVIDIA GPUs. GATK’s HaplotypeCaller, for instance, has been the community’s gold standard for genomic analysis pipelines for over a decade, but it takes up to 30 hours to process one whole human genome. Parabricks has accelerated HaplotypeCaller, reducing the total runtime from several hours to approximately 22 minutes while generating the same variant calls (Fig. 1). We’ve given Google’s DeepVariant the same treatment, reducing its runtime in our implementation by 15X. NVIDIA Clara Parabricks version 3.6 can run read alignment with an accelerated BWA-MEM,and variant calling with HaplotypeCaller, DeepVariant, and Strelka; on readily available hardware, this pipeline is 60x faster than the same pipeline using just the open-source versions and the software performs just as well whether on a local cluster or in the cloud.

Figure 2: Runtimes for open-source DeepVariant (blue) and GPU-accelerated Clara Parabricks (green). Runtimes for 30X Illumina short read data are on the left; runtimes for PacBio 35X long read data are on the right. GPU-accelerated DeepVariant in Clara Parabricks is 10–15x faster than the open-source version (blue “DeepVariant” bars vs. green “DeepVariant” bars). A single Google Cloud virtual machine (N1 family) with 32 vCPUs, 120GB of RAM, a “balanced” disk for data storage and 4 attached NVIDIA V100 GPUs were used when running Clara Parabricks. The same instance was used for benchmarking open-source DeepVariant but with only a single attached V100 to reduce costs, as a Non-Parabricks DeepVariant instance cannot utilize more than a single GPU. Virtual machines, such as the one used here, are easy to request and fit well within most default quotas on GCP.

In addition to germline callers, Parabricks includes accelerated versions of Mutect2, SomaticSniper and LoFreq for accelerated somatic analysis that are 40X, 16X and 10X faster, respectively. We also include a convenient interface for running the Strelka somatic workflow. This means the same multi-caller somatic pipeline that would take 75 hours on CPU takes less than 5 hours using Parabricks on GPUs.

Figure 2: Analysis Runtimes for open-source CPU-based somatic variant calling tools versus Parabricks-accelerated versions. Relative to the community versions, Clara Parabricks accelerates LoFreq by 6x, SomaticSniper by 16x, and Mutect2 by 42x. These benchmarks were run on 50X WGS matched tumor-normal data from the SEQC-II benchmark set. 4x V100

For structural variant calling, Clara Parabricks already included Manta and newly released v3.6 adds smoove. Smoove simplifies calling and genotyping structural variants for short reads within an individual and across populations. Based on Lumpy, it improves specificity by removing spurious alignment signals that are indicative of low-level noise and contribute to spurious calls.

In addition to speeding up analysis of new samples, acceleration also makes it practical to reprocess archived data in the BAM format. When combined with the Parabricks BAM2FASTQ and FASTQ2BAM tools, a 30X genome aligned to hg19 can be realigned to GRCh38 and variants called in approximately two hours.

An integrated toolkit for post-calling VCF merging, annotation, filtering and quality control

No matter which variant callers you use, the calling process is just the first step in analyzing genomic variants. Inspired by amazing community tools, the NVIDIA team has built easy-to-use tools for annotating, filtering and performing quality control of VCF files post-calling. Efficient, informative quality control is essential to good science, especially when working with accelerated AI computing techniques that surpass the performance of traditional software.

Consensus calling, where a suite of variant callers are applied to an input dataset and a union or intersection between the resulting call files is used for analysis, is a common method for improving results compared to using a single variant caller. Our vote-based VCF merger (VBVM) tool provides a simple-to-use interface for merging VCF files from multiple callers and filtering the results by selecting the minimum number of callers supporting a specific variant call. The VBVM tool automatically copies over variant caller INFO fields with a suffix to prevent lost information during the merging step. We believe VBVM is the simplest tool for merging multiple VCF files available, and stay tuned as we expect to add several interesting features in future versions of VBVM.

Databases like dbSNP, 1000 Genomes and ClinVar provide important information that’s useful for filtering VCF variants and interpreting their novelty and significance. Our new VCFANNO tool wraps the VCFANNO variant annotation program, providing VCF annotation from COSMIC, ClinVar, dbSNP, 1000 Genomes and gnomAD with zero configuration required — just pass the VCF files on the command line. Clara Parabricks v3.6 also includes a new FrequencyFiltration tool for filtering numeric fields such as those for allele count or allele frequency.

Lastly, we’ve built two easy-to-use tools for generating quality control reports for individual VCF files. Our VCFQC tool builds reports from an input VCF file and supports plotting metrics such as AD, DP, and MAPQ when they’re present in VCF INFO fields. The second tool, VCFQCBYBAM, takes a VCF and matching BAM file and uses our accelerated pileup implementation to plot these metrics even when the input VCF lacks these fields. An overview of our full calling and post-calling tools using example data is available here as a GitHub Gist:

What’s next?

We are already hard at work on new features for the next release of Parabricks and expect to release it in early 2022. We’d love to collaborate with you and hear your feedback on what additional tools and advancements are needed for your studies. You can visit our GitHub Gist to see how we ran our experiments. To learn more about the v3.6 release of NVIDIA Clara Parabricks check out our developer site, and to learn more about GPU use cases in genomics check out the presentations from this year’s NVIDIA GTC Conference (registration is free):

About the Authors

Eric Dawson is a bioinformatics scientist developing new methods for faster, more accurate genome analysis at Nvidia. Prior to joining Nvidia he completed his PhD at the University of Cambridge and the National Cancer Institute where he contributed to open-source tools for cancer genome analysis, graph genomes, and structural variant calling.

Johnny Israeli is a manager of genomics and drug discovery software at NVIDIA. He received a Ph. D. from Stanford University where his thesis focused on deep learning in genomics.

--

--

Johnny Israeli

Heading Genomics @ Nvidia. Prev DL4Genomics @ Stanford.