Validating bioinformatics methods: How reproducing results identified what I did wrong

Karl Sebby
truwl
Published in
4 min readFeb 10, 2022

A recommended way to validate or verify genomic lab tests is to share samples between labs and ensure that the results are comparable. On the computational side of things, data can be shared and if there is a reproducible bioinformatics workflow in place, the results can be duplicated exactly. We’ve done this with workflows on Truwl to ensure that everything is set up correctly AND to flesh out what parameters should be used. We first did this with the ENCODE pipelines. ENCODE data is publicly available and the Uniform Resource Identifiers (URI’s) for data can be obtained from experiment pages on the ENCODE portal. We made a video that demonstrates this by reproducing an ENCODE ChIP-seq experiment. The first attempts to reproduce results for ENCODE ATAC-seq experiments were actually unsuccessful. The workflow ran just fine, but the counts didn’t match. The reason is that we didn’t select a single option that was needed: auto_detect_adapter. This is not set by default, but is used in most if not all of the ENCODE experiments. We made a ‘forkable’ example of an ENCODE ATAC-seq experiment that successfully reproduced ENCODE results so anyone can copy that experiment into their own workspace, reproduce the experiment for themselves, or swap out data for their own experiment.

We’ve been talking a lot about our Variant Benchmarking workflow lately, especially since we released a free ‘community edition’. Our first attempts to reproduce some results with the full-featured workflow were also unsuccessful, and unsurprisingly, the problem once again was not with the workflow itself but in the inputs and parameters we used. In Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads the authors use Genome in a Bottle (GIAB) samples to benchmark the PEPPER-Margin-DeepVariant (P-M-DV) workflow. In the Data Availability section there was good news:

We have made the analysis data available publicly (variant calling outputs, genome assemblies) in: https://console.cloud.google.com/storage/browser/pepper-deepvariant-public/analysis_data.

Yes! VCF’s with publicly available data with URI’s. Thank you authors. In the Google bucket there are VCF files for GIAB samples HG003 and HG004 generated with different variant callers: Medaka, Clair, Longshot, and P-M-DV and benchmarking metrics are tabulated in the supplementary information.

My first attempt to reproduce the benchmarking metrics were close, but not right on. The supplementary information shows how the P-M-DV team ran commands for the benchmarking tool, hap.py, and it was the same as our benchmarking workflow and we were using the same version of hap.py. This made me suspect that we were using a different version of the reference-set or benchmarking region version. I emailed the corresponding authors and Andrew Carroll got right back to me and included the first author Kishwar Shafin. With Kishwar’s help, I quickly found the issue and it was with a wrong assumption that I had made with the benchmarking regions. GIAB has stratification regions so that you can measure the performance of your variant calling in specific sections of the genome. For instance, you might only care how well you call variants in genes included in an oncogene panel or in the MHC region. The region most frequently used and reported, including here, is called ‘benchmarking regions’ which covers all the regions where there is high confidence in the reference-set. What I had done, was use the same region definition for HG002, HG003, and HG004, but in contrast to other genome stratification regions the ‘benchmarking regions’ for these three samples are subject-specific due to some structural variants that are unique to individuals in the trio. With this knowledge in hand, I started a new experiment on Truwl, and successfully reproduced the recall, precision, and F1-score with the first VCF I tried. In the same experiment (Truwl groups jobs together in ‘experiments’) I ran the benchmarking workflow on the other VCF’s provided by the P-M-DV team.

We’ve found that tracking metrics across workflow jobs is extremely useful and we’ve developed a syntax for defining values (inputs and/or outputs) that we want to track for any workflow. For the benchmarking workflow that includes precision, recall, and F1-score (and many others).We then pass these values to our Performance Metrics comparison table to see everything in one place.

Truwl Performance Metrics comparison table.
Performance metrics comparison table for PEPPER-Margin-DeepVariant benchmarking experiment. Metrics shown for SNPs and the HG003 benchmarking region.

Because you can look at benchmarking metrics by region, by variant type, etc. there are drop down menus on the Performance Metrics page to select these choices. Pictured here, the table is showing results for SNPs in the HG003 benchmarking region which should match up to the published results for SNPs in HG003, and they do. The table shows results for the HG004 samples but these results don’t match up to the published results because this shows metrics for HG004 VCFs in the HG003 benchmarking region. Toggling the region to ‘HG004 benchmarking region’ provides metrics that match up with the published results for HG004. Toggling from SNPs to INDELS (and choosing the appropriate region) also gives the correct results for the remaining combinations.

Just like the ATAC-seq workflow example we’ve made a publicly-shared example job from this experiment so others can fork the example into their own experiment to get going from a sound starting point.

Summing up

When setting up methods it is extremely useful to reproduce someone else’s trusted results first. Just because the workflow runs and produces reasonable results doesn’t mean everything is A-okay. Knowing the right parameters, and input files are critical to making your results reliable.

--

--