Accessible and uniform benchmarking

Karl Sebby
truwl
Published in
8 min readOct 14, 2021

Do you remember GCAT? You know what I’m talking about, the Genome Comparison and Analytic Testing platform. If you remember, congratulations, you have earned your bioinformatics veteran badge. GCAT, first released in 2013, was a popular tool for benchmarking short-read genome mappers and variant callers. However, in a fast growing field that is bringing in new practitioners it is no longer as widely familiar as it used to be. In a recent bioinformatics benchmarking survey we sent out, the first 11 respondents had no recollection of ever using it.

I happen to belong to the “newer practitioners” group and wasn’t familiar with the platform, but have now spent a bit of time studying what GCAT had to offer and what made it popular.

So what was GCAT exactly? GCAT was a platform to standardize genome analysis in a way that enabled comparisons between the performance of bioinformatics tools used for the identification of genetic variation.¹ In other words, it provided a standardized way to measure the performance of short-read mappers and variant callers. It worked like this: you downloaded a test data set, did your analysis locally, and then uploaded results to the platform which would create a report that would show you how well you how well your alignments and variant calls performed with respect to a truth set and in comparison to other methods. Two major innovations provided by GCAT were the standardization of metrics and the open accessibility of the tool.

The GCAT process.

New bioinformatics tools are being made all the time to the tune of nearly 2500 in 2017.² That’s almost 7 new tools per day. Inevitably, many tools perform the same or similar functions. To justify the effort of creating new tools and provide others with information that enables them to choose the right tools for their projects, you need to show that the tool works better than others in some way. Maybe it runs faster. Maybe it uses less memory and can be used on your laptop. Maybe it gives more accurate results in all or certain situations. For variant calling, the primary metrics are derived from comparisons of the variants you detect to the “real” variants from a simulated or extensively characterized dataset such as those provided by the Genome in a Bottle (GIAB) consortium. From that you determine the variants you called that were correct — true positives; the variants you called that were wrong — false positives; and the variants that you missed — false negatives. You can then derive all sorts of statistics from these measures segregated by genome regions, variant types, and confidence levels — more on that another time.

The problem is that tool developers can introduce bias by picking certain metrics and datasets and tuning parameters to get good results. GCAT helped level the playing field by standardizing metrics and datasets used to evaluate tools so methods could be directly compared. According to GIAB consortium leader Justin Zook in a 2013 interview: “It’s the first tool that I’ve seen that allows you to compare lots of different methods in the same way, as opposed to everyone doing their own validation of their own methods, so it’s nice to have a centralized resource like this.”³

GCAT was freely available on a website called bioplanet.com. Providing uninhibited access helped make GCAT very popular. An early snapshot showed that GCAT had been viewed more than 90,000 times from visitors from 144 countries and processed over 700 reports. This was all by design as GCAT was meant to be a welcoming community-driven platform where as many people as possible could connect and collaborate on methods and standardization development.

So what happened to GCAT? The site was freely supported for years and the company that supported it, Arpeggi, was acquired by consumer DNA testing company Gene by Gene, which at the time was expanding into the medical and research testing space. Benchmarking is a ubiquitous task that needs to be done by all clinical laboratories. To keep up with certifications, compliances, and demonstrate continued proficiency, genomic testing laboratories need to show that they can reliably detect variants that their tests rely on. This would make the benchmarking capabilities developed by Arpeggi attractive to any testing lab that did not already have their own system in place. As Gene by Gene had different goals than Arpeggi, at some point they seem to have decided not to support it any more and the Arpeggi team moved on to other things.

GCAT addressed problems faced by analysts to assess the huge diversity of tools when implementing a plan for genomic variant detection and for methods developers to determine a fair assessment of their methods. Overcoming these obstacles is a prerequisite to develop bioinformatics workflows and genomic tests with high accuracy and robustness. With the explosion of bioinformatics methods to choose from and the increase in genomic assays for clinical applications the need for a GCAT-like tools is greater now than it was then.

So what has happened in benchmarking since? For one thing, the GIAB consortium has made great strides in expanding and improving the datasets used as gold standard “truth sets”. The other big development is the work done by the Global Alliance for Genomics and Health (GA4GH) Benchmarking Team to develop standards on how variants are described and compared. This work was described in “Best practices for benchmarking germline small-variant calls in human genomes” in 2019.⁴Along with the publication, the team released a reference implementation of their recommendations and made it available as an app on https://precision.fda.gov/. PrecisionFDA is most famous for hosting challenges such as the Truth Challenges for benchmarking and is an implementation of the DNAnexus platform that is hosted by the FDA. Anybody can request access to the platform and once your request is approved you can run compute jobs on what I understand is the FDA’s dime. On the platform there is a dedicated section called comparisons which provides you with two apps to choose from: app-pfda-comparator, and GA4GH benchmarking. Both apps enable you to select a query dataset and a benchmark truth set, but the GA4GH benchmarking apps gives you more choices for options. The comparison apps produce several files, the most important of which is an html report that plots precision vs. recall for differing quality control thresholds and tabulates some of the metrics that come out of the GA4GH comparator program hap.py.

While the current capabilities for benchmarking are great to have, we think that the needs of the community continue to be underserved given the widespread need, range of use cases, and importance of benchmarking. According to Justin Zook:

“While GIAB and GA4GH have published robust benchmarks and best practices for benchmarking small variants,⁴-⁵ few users have the time and expertise to take full advantage of these methods. In particular, the benchmarking tools can be challenging to run, and they can output 10’s of thousands of performance metrics when stratifying by variant type and genome context. This can give a granular view of where a method performs well or poorly, but better reports and visualization tools are needed to make this accessible to most users. Beyond germline small variants, there are needs for better benchmarks and benchmarking tools for more challenging variants, including complex structural variants, copy number variants, tandem repeats, and somatic variants. Unlike small variants, larger and complex variants have few standards for representation or tools for comparing different representations, which will be needed to enable robust benchmarking for the whole genome. In an ideal world, users could upload their variant callset for a benchmark sample and get a report that gives a summary of the most important performance metrics, including variant types and genome regions where the method has limitations, and how the metrics compare to other methods.”

At Truwl, we’re turning some of our efforts towards these important benchmarking methods as part of our mission to make bioinformatics capabilities as broadly accessible as possible and working more towards Justin Zook’s “ideal world”. We’re taking some inspiration from GCAT as our values align with the openness and accessibility of that platform, but there’s been a lot of great technological advances that can help make the experience better than ever. We’ll be working to make the input editor interface as intuitive as possible, improving and adding visualizations, simplifying comparisons across many benchmarking jobs, and adding capabilities to simplify the process of selecting and optimizing workflows and monitoring their performance over time in production environments.

Truwl’s simplified benchmarking input editor

We’re working with some of our early access users now on the first iteration of our variant calling benchmarking workflow, interface, and comparison table and will enable general availability soon. We’re excited to get these capabilities into the community’s hands and start getting feedback on how to make it fit end users’ needs and best serve the community going forward.

Truwl comparison table for viewing metrics across runs.
Upset plots comparing query vcf to ‘competitor’ vcfs for 3 genomic regions.

While we’re starting with mapping and variant calling, we don’t plan on stopping there. As the number and type of genomics assays grows and become integrated into clinical tests, so does the need for standardizing and productionization of benchmarking methods. For a brief overview take a look at this great repo from Jared Andrews: https://github.com/j-andrews7/awesome-bioinformatics-benchmarks. And if you’re interested in benchmarking and standards you shouldn’t miss the recent issue of Nature Biotechnology (https://www.nature.com/nbt/volumes/39/issues/9) that includes a suite of papers from SEQC (https://sites.google.com/view/seqc2/home) and the Association of Biomolecular Resource Facilities (https://www.abrf.org/).

And by the way, after our survey was available for a bit longer the bioinformatics veterans familiar with GCAT did start to make a solid showing.

**Special Thank you to David Mittelman for historical insights for GCAT. David was co-founder and Chief Science Advisor for Arpeggi and is now founder and CEO of Othram Inc. which is solving cold cases using advanced DNA sequencing technology. I highly encourage you to follow the fantastic work that Othram is doing at https://dnasolves.com/ and https://substack.com/profile/1965923-david-mittelman

** Thank you to Justin Zook for providing a quote for this post.

References:

1. Highnam, G. et al. An analytical framework for optimizing variant discovery from personal genomes. Nat. Commun. 6, 6275 (2015).

2. Clément, L. et al. A data-supported history of bioinformatics tools. ArXiv180706808 Cs (2018).

3. Arpeggi Adds Genome in a Bottle Consortium Data to GCAT. Pubs — Bio-IT World https://www.bio-itworld.com/news/2013/07/15/arpeggi-adds-genome-in-a-bottle-consortium-data-to-gcat.

4. Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).

5. Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. 2020.07.24.212712 (2020) doi:10.1101/2020.07.24.212712.

--

--