Demystifying benchmarking: A guide to germline variant calling metrics

Published in

truwl

13 min readDec 13, 2021

We recently developed and published a germline variant calling pipeline on Truwl that follows the GA4GH best practices using hap.py. This workflow outputs many metrics that can be examined in a table and we provide a glossary that describes what all these metrics are. However, that doesn’t make it abundantly clear what all these ACTUALLY mean. I initially struggled to put my head around some of these, so we’ve put together this guide to help clarify the metrics and help you interpret their meaning. Most of the content here was derived from the hap.py user manual and the Best practices for benchmarking germline small-variant calls in human genomes paper from the GA4GH benchmarking team. There are a few things to understand before diving straight into explanations of the metrics.

Query and truth sets

Benchmarking depends on comparisons so there needs to be at least two things to compare. These two things are the query set and the truth set which are both variant call sets from the same reference sample. The truth set can also be called the “reference callset” or “high confidence set”. I’ll stick to truth set here because the metric names use that term, even if “truth” is a little misleading. The truth set is an agreed upon ground truth, that is presumed to be correct, of where and how the reference sample varies from the reference genome such as GRCh37 or GRCh38 (watch out of the overloading of the term ‘reference’ here). Generating a high quality truth set can involve multiple labs measuring the same reference sample with multiple technologies.

The benchmarking workflow. A truth set is developed through many measurements of a sample to determine high-confidence ground truth variant calls. Methods can be evaluated by generating a set of variant calls (query set) and comparing against the truth set. The query set can be generated from a reference sample or a set of sequencing reads used to create the truth set.

The query set is the variant call set determined from the reference sample with (usually) less exhaustive measurement than the truth set. The query set is then compared to the truth set to see how close it is to the ‘right’ answers in the truth set. To assess the analytical performance of an entire sequencing workflow the query set is generated by starting with the reference sample (cell line or sequencing library), doing the sequencing experiment, and performing the computational analysis to create the variant call file. To determine the performance characteristics of just the computational steps, sequencing read files from multiple technologies can be downloaded and the variant call file can be generated from that starting point.

Regions

The genome is big. And diverse. A lot of that diversity includes regions that are not so diverse. In any event, determining variants in different parts of the genome varies in complexity and the amount of confidence that the truth set is correct varies by region. For example, the truth sets provided by the Genome in a Bottle Consortium define “high confidence regions” as regions where the concordance between two methods is 99.7%. Metrics can be separated by regions to give researchers the capability to focus on the regions of the genome that they care about. For an oncogene panel, you might not care about how well you can call variants in regulatory regions. Or you might only care about how well your assay performs in the region for the major histocompatibility complex (MHC) to the exclusion of all else.

Positives and negatives

Positives and negatives are the base metric that other metrics are derived from. A true positive is a variant detected in the query set that is in the truth set; a false positive is a variant that was called in the query set but is not in the truth set, and a false negative is a variant that is in the truth set but not in the query set (the variant was missed). Completing all combinations of true/false with positive/negative would include true negative, which isn’t practical or useful. Imagine dealing all the non-existent ‘variants’ that you didn’t detect because they weren’t there.

Match types

Comparing variants at a specific locus in a query set to a truth set is not a straightforward endeavor that results in definitive True/False answers. Variants do not live in isolation but exist in the context of nearby variants on the same chromosome and different alleles can be heterozygous or homozygous for each variant. To help accommodate this, the GA4GH benchmarking team defined three types of matches.

Genotype match: This is the most stringent match type. To be counted as a genotype match the query set has to match the truth set for both alleles. A genotype match is the best you can do.

2. Allele match: This match type means that one allele is matched correctly.

3. Local match: This match type is the least stringent and counts a match if the query and truth set are both different from the reference, even if the query does not have either of the same variants as the truth set.

Counting true positives, false positives, and false negatives

Genotype matching is used by hap.py for determining the number of true positive variant calls. Rather than marking allele matches and local matches as completely incorrect, these partial matches are labeled as false positive genotype match (FP.GT) and false positive allele match (FP.AL) respectively.

FP.GT = call is a false positive at the genotype match level. The call is an allele match.
FP.AL = call is a false positive at the allele match level. The call is a local match.

True positives are the goal, but FP.GT means that one of the alleles is correct and FP.AL means that at least a variant was called at a locus where a variant was expected even if it wasn’t the correct one(s).

For the examples in the Match Type section only the genotype match would be counted as a true positive (TP). The allele match example is counted as BOTH a false positive and a false negative as well as a FP.GT so it contributes to the counts in 3 columns. The local match example is also counted as both a false positive and false negative as well as a FP.AL.

Unknowns

There are regions where there are no confident calls in the truth set. If the query set provides variants in these regions, the truth set cannot confidently say whether they are a TP or FP so they are counted as unknown.

Transitions and Transversions

Some of the metrics specify the ratio of transition to transversion mutations. Transitions are mutations where a purine base is mutated to another purine base (A<->G) or a pyrimidine base is swapped for another pyrimidine (C <-> T). Transversions are when a purine base is mutated to a pyrimidine base or vice versa (A <->C, A<->T, G<->C, G<->T). Transition mutations are less likely to cause an amino acid change. If mutations were random the transition to transversion ratio would be near 0.5. However, mutations are not random and transition mutations occur at a higher frequency, especially in coding regions. Because transitions and transversions are only defined for SNPs, the transition to transversion ratio is undefined for indel variant types.

Transition mutations shown as gold dashed arrows and transversion mutations shown as solid blue arrows.

Rows in the metric csv

One of the main outputs of hap.py is a delimited file with the metrics in the columns and each row contains a unique combination of stratification values. The main stratification categories are type, subtype, subset, filter and genotype.

Type: Type can have values of SNP or INDEL.

Subtype:If the mutation is a SNP, the variant subtype can be a transition (ti) or transversion (tv) and indels can have subtypes of complex (C ) , insertion (I), or deletion (D). The letter representing indel variant types are followed by a variant length range. e.g. C1_5 (1–5 bp complex variant), C6_15, C16_PLUS, D1_5, D6_15, D16_PLUS, I1_5, I6_16, I16_PLUS.

Subset: Subset of the genome stratification region.This is usually the name of the bed file(s) that was used to specify regions. Many regions can be specified with bed files for a single benchmarking experiment.

Filter: Variant filters. Includes PASS and ALL and filters from the query VCF such as q10.

Genotype: Genotype of benchmarked variants which include het (heterozygous variant calls), hetalt (heterozygous alternate variant calls), and homalt (homozygous alternate variant calls).

The above stratification categories can be combined in different ways to get sets of metrics, each of which is recorded in a row of the output file. For example, a single row may contain metrics for SNPs (type) that are transitions (subtype) in MHC regions (subset), that pass all filters (filter), and are heterozygous (genotype). The * value is also often present to specify that the metrics in that row are for all values of the stratification category. There are a lot of possible combinations here, especially if many regions (subsets) are defined, so there can be a lot of rows and the files can get to a decent size. For examples, check out the extended csv’s from the PrecisionFDA V2 challenge submissions.

Metrics

Okay. With this background in place, we can now finally understand what all the output metrics mean. The descriptions of the metrics are repeated here and an explanation is provided. It is important to remember that each of these metrics is provided for the designated set of stratification values (type/subtype/subset/filter/genotype) for the row. All the metrics here are simply counts or ratios of counts.

TRUTH.TOTAL

Description: Total number of truth variants
Explanation: This is the total number of variants in the truth set.

TRUTH.TP

Description: Number of true-positive calls in truth representation (counted via the truth sample column)
Explanation: This is the number of calls in the truth set which are a genotype match to calls in the query set.

QUERY.TP

Description: Number of true positive calls in query representation (counted via the query sample column)
Explanation: This is the number of calls in the query set which are a genotype match to calls in the truth set. The difference between TRUTH.TP and QUERY.TP is subtle and accounts for the two ways which true positives can be counted — starting with the truth set and looking for matches in the query set (TRUTH.TP) or starting with the query set and looking for matches in the truth set (QUERY.TP). These numbers can differ if complex changes are represented by a single call in one set and multiple calls in another set. For example, if variant A is a single row in the truth set but is equivalent to variants A₁ plus A₂ in the query set this would result in 2 counts in the query representation and one in the truth representation. In practice, TRUTH.TP and QUERY.TP are pretty close or equal.

TRUTH.FN

Description: Number of false-negative calls = calls in truth without matching query call
Explanation: This is a count of the variants in the truth set that were not seen in the query set (missed calls).

QUERY.TOTAL

Description: Total number of query calls
Explanation: This is a count of variants in the query set.

QUERY.FP

Description: Number of false-positive calls in the query file (mismatched query calls within the confident regions)
Explanation: This is a count of variants that were called in the query set that are not present in the truth set. Allele and local matches are included in the false positive count.

QUERY.UNK

Description: Number of query calls outside the confident regions
Explanation: This is the count of unknown calls in the query set. See the explanation of Unknowns above.

FP.GT

Description: Number of genotype mismatches (alleles match, but different zygosity)
Explanation: This is a count of allele matches as described in the Match Types section. Allele matches are also counted as QUERY.FP and TRUTH.FN.

FP.AL

Description: Number of allele mismatches (variants matched by position and not by haplotype)
Explanation: This is a count of local matches as described in the Match Types section. Local matches are also counted as QUERY.FP and TRUTH.FN.

TRUTH.TOTAL.TiTv_ratio

Description: Transition / Transversion ratio for all truth variants
Explanation: The ratio of transition mutations and transversion mutations in the truth set. See the Transitions and Transversions section above.

TRUTH.TOTAL.het_hom_ratio

Description: Het/Hom ratio for all truth variants
Explanation: This is the ratio of all heterozygous variants to all homozygous variants in the truth set.

TRUTH.FN.TiTv_ratio

Description: Transition / Transversion ratio for false-negative variants
Explanation: The ratio of transition mutations to transversion mutations in that exist in the truth set but were not in the query set (missed variants). See the Transitions and Transversions section above.

TRUTH.FN.het_hom_ratio

Description: Het/Hom ratio for false-negative variants
Explanation: This is the ratio of heterozygous variants to homozygous variants that exist in the truth set but were not in the query set. By comparing with TRUTH.TOTAL.het_hom_ratio whether more homozygous or heterozygous variants are missed can be assessed.

TRUTH.TP.TiTv_ratio

Description: Transition / Transversion ratio for true positive variants
Explanation: The ratio of transition mutations to transversion mutations in the truth set that were called in the query set.

TRUTH.TP.het_hom_ratio

Description: Het/Hom ratio for true positive variants
Explanation: The ratio of heterozygous mutations to transversion mutations in the truth set that were called in the query set.

QUERY.FP.TiTv_ratio

Description: Transition / Transversion ratio for false-positive variants
Explanation: The ratio of transition mutations to transversion mutations that were called in the query set but do not exist in the truth set (called variants that do not exist). See the Transitions and Transversions section above.

QUERY.FP.het_hom_ratio

Description: Het/Hom ratio for false-positive variants
Explanation: This is the ratio of heterozygous variants in the query set that are not in the truth set to homozygous variants in the query set that are not in the truth set.

QUERY.TOTAL.TiTv_ratio

Description: Transition / Transversion ratio for all query variants
Explanation: The ratio of transition mutations to transversion mutations that were called in the query set. This includes both true positive and false positive calls.

QUERY.TOTAL.het_hom_ratio

Description: Het/Hom ratio for true positive variants
Explanation: This is the ratio of all heterozygous variants to all homozygous variants in the query set. This includes both true positive and false positive calls.

QUERY.TP.TiTv_ratio

Description: Transition / Transversion ratio for true positive variants (query representation)
Explanation: The ratio of transition mutations to transversion mutations in the query set that were present in the truth set. This can differ from the TRUTH.TP.TiTv_ratio because complex variants can be represented as multiple variants in one set and a single variant in another. See the explanation in the QUERY.TP section.

QUERY.TP.het_hom_ratio

Description: Het/Hom ratio for true positive variants (query representation)
Explanation: This is the ratio of heterozygous variants to homozygous variants in the query set that were present in the truth set. This can differ from the TRUTH.TP.het_hom_ratio because complex variants can be represented as multiple variants in one set and a single variant in another. See the explanation in the QUERY.TP section.

Query.UNK.TiTv_ratio

Description: Transition / Transversion ratio for unknown variants
Explanation: The ratio of unknown transition mutations to unknown transversion mutations in the query set. The the explanation of Unknowns above.

QUERY.UNK.het_hom_ratio

Description: Het/Hom ratio for unknown variants.
Explanation: This is the ratio of unknown heterozygous variants to unknown homozygous variants in the query set. The the explanation of Unknowns above.

Subset.Size

Description: When using stratification regions, this gives the number of nucleotides contained in the current subset
Explanation: This size comes from the length of the regions specified in stratification bed files.

Subset.IS_CONF.Size

Description: This gives the number of confident bases (-f regions) in the current subset
Explanation: This is a count of how many high-confidence base calls are in the truth set for the subset region.

Derived statistics

Some things are better together and that includes metrics. Looking at the simple metrics above doesn’t provide a quick view of performance so there are some key statistics that are commonly used to gauge performance.

METRIC.Precision

Description: Precision of query variants = QUERY.TP / (QUERY.TP + QUERY.FP)
Explanation: Precision is the fraction of calls (between 0 and 1) in the query set that are true positives. A high precision indicates the absence of false positives, but does not give any indication about the fraction of true positives that were detected. Precision is also sometimes called specificity.

METRIC.Recall:

Description: Recall for truth variant representation = TRUTH.TP / (TRUTH.TP + TRUTH.FN)
Explanation: Recall is the fraction (between 0 and 1) of known variants that are detected. This assesses the ability to detect true positives but does not assess the absence of false negatives. Recall is also referred to as sensitivity.

METRIC.F1_Score

Description: Harmonic mean of precision and recall = 2METRIC.RecallMetric.Precision/(METRIC.Recall + METRIC.Precision)
Explanation: Neither precision nor recall are very useful on their own for determining performance. This is where the F1 score comes in. For example, perfect precision could be obtained by only calling the single highest-confidence variant resulting in one TP and zero FP: 1 / (1+0) = 1. Sure, no false positives were called but hardly anything was detected either. This is an extreme case of being too conservative to be useful. Likewise, perfect recall can be obtained by calling everything a variant. There will be no false negatives: TP / (TP + 0)= 1. Not a single variant will be missed but there will be too many false positives to have useful results. This is an extreme case of being too aggressive. Overall performance is a balance between precision and recall. The desired balance will depend on the use case — are you more worried about missing a variant or calling a variant that isn’t real? However, the F1 score which is the harmonic mean of precision and recall provides a good standardized measure of precision and recall together. If either precision or recall are low, the F1 score will be low and a F1 score near 1 means that both precision and recall are near 1.

Wrapping up

Thanks for sticking it out until now and I hope this helps clarify germline benchmarking metrics a bit. I’ll likely come back to it as a useful reference for myself. There’s a lot of content here and there’s a good chance that one or two things aren’t quite right. If you catch something please let me know so I can fix it. And when you’re ready to do your own benchmarking experiment, check out the germline variant benchmarking workflow on Truwl and let us know what you think.