A brief overview of benchmarking in genomics

Karl Sebby
truwl
Published in
8 min readNov 12, 2021

There’s always a steady stream of genomics benchmarking papers coming out, but recently it’s been like drinking out of a firehose. But if you like benchmarking, it’s a firehose filled with yummy goodness. For a few high profile examples see the September issue of Nature Biotechnology which featured results from the MAQC-IV/SEQC2 consortium and the Association of Biomolecular Resource Facilities (ABRF), and this paper in Nature Methods¹ showing the performance of the PEPPER-Margin-DeepVariant pipeline with nanopore reads. So what are these benchmarking papers all about, why are there so many, and why do people care?

Benchmarking in genomics

Benchmarking an experiment is analogous to taking an exam in school and getting a grade: you make a best attempt at finding the right answers and then you compare your answers to a grading sheet to see how well you did. For genomics, you perform an experiment with a reference sample with a ‘known’ composition and you compare how close you come to measuring and calculating the known values. The most well known use-case is for germline variant calling to detect small variants — SNPs and indels — with reference samples and datasets such as those from the Genome in a Bottle (GIAB) consortium.

As with all tests with multiple parts, a single score can’t tell the whole story, so there is a whole host of statistics that can be generated to give more detailed insight. Taking the school exam example further, a student might do really well on a particular type of question, but poorly on others. And what about partial credit? If a math exam involves solving for the roots of quadratic equations (2 answers for each question) it’s possible to get one of two roots correct for every question on the exam; a score of 50% if partial credit is given. How should that be differentiated from solving half of the equations all correctly and not providing an answer at all for the other half — another way that a score of 50% could be assigned? These two test takers will obviously need to make different changes to their strategies to perform better the next time — my guess would be that the first is making some systematic error and the second is working so slowly that they run out of time before finishing. This case is actually pertinent for human genomics. Since the genome is diploid there are two correct values — if only considering point mutations — for each position in the genome so you can get the right answer on one allele, the wrong one on the other, or get them both correct or incorrect. Multiply that by a whole bunch of positions and throw in more complex mutation types and things get complicated pretty quickly.

Seeing how things measure up.

Of course having an answer sheet gives you the opportunity to experiment with your strategy to find one that helps you get the right answers consistently. Pretty much like a practice test. Do well on the practice tests, and you can be pretty sure that you’ll do okay on the real test. This is exactly what is done with genomics experiments.

The genomics experimental workflow can be broken into two main categories: data generation and computational analysis (bioinformatics), and strategies in both these areas can be adjusted in nearly limitless ways to tune performance. For the data generation step, there are tons of options for preparing the sequencing library — how nucleic acids are extracted and amplified, then additional options on how the measurements are made including what instrument is used. The bioinformatics that follows sequencing can involve dozens of different programs, each with their own sets of adjustable parameters. Since library preparation methods, instrumentation, and bioinformatics methods are constantly being updated or completely new technologies are being introduced, there is a constant need for benchmarking studies to see how the new and updated methods improve or worsen the performance of measurements in different situations. So when something new comes off the sequencing technology assembly line you use benchmarking to see how things measure up. There is an additional complication though; the answers (reference set) aren’t always right.

Reference materials

Accurate benchmarking depends on having an accurate ‘ground truth’ to compare to. Enter reference materials. To benchmark an experiment from start to finish, reference materials are a must. One set of reference materials from GIAB are the trio from a son, father, and mother of Ashkenazim Jewish ancestry known as samples HG002, HG003, and HG004. These cell lines have been measured with nearly every measurement technique in the sequencing, array, and genome mapping arsenal. In 2016 the GIAB consortium described the characterization of these and other cell lines using 12 different technologies across a dozen or so labs.² The measurement of these and other samples is ongoing and the benchmark datasets continue to improve, but they are still not perfect. Detecting some types of variants is still difficult and some regions of the genome such as highly repetitive regions and GC-rich regions are particularly problematic. To mitigate this, genomic regions are defined where there is more or less confidence in the correctness of the benchmark sets. Benchmarking high confidence regions are likely to have better performance metrics than lower confidence regions.

Photo by Diana Polekhina on Unsplash

There are different reference materials for different types of benchmarks. A timely editorial³ describes how the SEQC2 project is organized into six themes that are important for clinical applications: genome sequencing (germline variant calling), cancer genomics (somatic variant calling), circulating tumor DNA, targeted RNA sequencing, DNA methylation, and single cell sequencing. Each of these requires its own set of standards. While characterization of the GIAB samples for germline variant calling have been under development for some time, other standards are less mature.

Recently the GIAB samples were thoroughly probed for 5-methylcytosine modifications to expand their use as reference materials for DNA methylation assays.⁴ For cancer genomics, including circulating tumor DNA, where important mutations could be present at a wide range of variant allele frequencies, a mixture of 10 cancer cell lines was recently developed and proposed as a standard.⁵ Single-cell experiments aim to detect all the different cell types in an sample and require the ability to dependably group individual cells that are similar to each other (mixability) and separate cells that are different from each other (clusterability) based on their transcription profiles. This requires samples with known cell types that are similar and different from each. A few reference samples have been used recently including mixtures of a B-lymphocyte cell line and breast cancer cell line from the same donor,⁶ a mixture of 5 cell types from humans, mouse, and dog,⁷ and mixtures of either cells or RNA from 5 different cancer cell lines.⁸

The whole point

So why is this all so important? There are multiple purposes, such as validating new technology development as mentioned earlier, but the main push is to ensure that genomic tests can be reliably used for clinical applications. Analytical validation is a prerequisite to widespread adoption of any clinical test and genomics is foundational to enter into the new promised land of personalized medicine where diagnoses and treatments are tailored to each patient in routine care and clinical trials. But genomic tests have unique challenges compared to clinical tests that measure a few well-understood values like cholesterol levels, metabolic panels, or even genetic tests that look for one to a few known mutations that are common in certain populations and conditions. Instead of measuring a few values, a sequencing experiment can measure millions or billions of positions throughout the genome, each with different confidence levels and clinical utility. If you want to use this information to inform clinical decisions it is of paramount importance to know what you can detect confidently and reproducibly, and what you cannot.

Regulatory and accreditation organizations are particularly interested in the evaluation of analytical performance as they require this information to be able to assess the usefulness of a test. It is no surprise that organizations like the FDA and the College of American Pathologists (CAP) lead or are intimately involved in these kinds of studies. Diagnostic companies are also deeply involved. Without being able to show that their tests are useful and perform with known characteristics, they can’t market their products, or worse, risk reporting wrong results back to patients and care providers. They also use reference materials and benchmark data sets to optimize their test performance. For variant detection there is a balance between detecting as many variants as you can (sensitivity) without getting greedy and ending up with too many false positives and reference materials and benchmarking studies are necessary to optimize and validate testing workflows.

What still needs to be done?

There’s always more to be done and there are a few big ideas to highlight. The first is that more reference materials need to be established and continuously improved. The number of types of genomics assays and applications continues to grow and well-characterized, widely available, reference materials are needed to facilitate their development and improvement. Another issue that is near and dear to our hearts is that the computational methods required to do benchmarking studies also need to be improved and made accessible to researchers. Publications can provide good guidance and snapshots of the state-of-the art but methods need to be evaluated in the context of the situation where they will be used and researchers should be able to benchmark methods quickly and confidently.

1. Shafin, K. et al. Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads. Nat. Methods 1–11 (2021) doi:10.1038/s41592–021–01299-w.

2. Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).

3. Mercer, T. R., Xu, J., Mason, C. E., Tong, W., & on behalf of the MAQC/SEQC2 Consortium. The Sequencing Quality Control 2 study: establishing community standards for sequencing in precision medicine. Genome Biol. 22, 306 (2021).

4. Foox, J. et al. The SEQC2 Epigenomics Quality Control (EpiQC) Study: Comprehensive Characterization of Epigenetic Methods, Reproducibility, and Quantification. bioRxiv 2020.12.14.421529 (2021) doi:10.1101/2020.12.14.421529.

5. Jones, W. et al. A verified genomic reference sample for assessing performance of cancer panels detecting small variants of low allele frequency. Genome Biol. 22, 111 (2021).

6. Chen, W. et al. A multicenter study benchmarking single-cell RNA sequencing technologies using reference samples. Nat. Biotechnol. 39, 1103–1114 (2021).

7. Mereu, E. et al. Benchmarking single-cell RNA-sequencing protocols for cell atlas projects. Nat. Biotechnol. 38, 747–755 (2020).

8. Tian, L. et al. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nat. Methods 16, 479–487 (2019).

--

--