Faster NGS Duplicate Marking

Eric Scott
grail-eng
Published in
9 min readMar 1, 2021

Introduction

As the use of Next-generation Sequencing (NGS) continues to make a transition from academia to use in the clinic, there is a renewed need to revisit core bioinformatics tools to meet a new set of challenges. Foremost is the need to address the computational costs of NGS that accompany the scale of large clinical trials. The computational costs come not only in the time taken to analyze data, but also in the computational resources consumed. In both cases, the inefficiencies associated with the currently available bioinformatics tools impede the field’s ability to easily iterate during development and increase the costs for any prospective NGS test.

At GRAIL, we use NGS to find cancer signals in blood. After sequencing is complete, the raw sequencing data enters our custom genomics pipeline, where the sequencing reads go through alignment, duplicate marking, artifact removal, quality control, and a number of other custom tools.

While many of these components are commonplace for NGS pipelines, the building of this pipeline has been non-trivial due to several factors:

  • Number of patient samples: There are a large number of individual samples to process, because GRAIL’s clinical studies involve thousands to tens of thousands of participants.
  • Amount of data to process: Many samples are sequenced at high depth, resulting in a large amount of data per sample.
  • Cost of scaling: With larger amounts of data per sample, and additional hardware, the traditional algorithms and tools commonly used in genomic analysis scale poorly and are often costly.
  • Cost of making upstream changes: Because we modify and improve the pipeline software often, this leads us to periodically re-run the pipeline on all samples so that the new updated outputs are available to downstream users.

We address the computational cost using several approaches. In this article we focus on how we’ve addressed improving the efficiency of the duplicate marking stage of the pipeline by writing an in-house duplicate marking tool called Doppelmark.

Duplicate Marking Background

There are a number of existing approaches currently used to identify and mark duplicate read pairs. Doppelmark considers two read pairs to be duplicates if the pairs have matching pre-clipped 5’ positions and orientations; this is the same approach used by Picard¹ and Sambamba².

In the following example figure, A represents a read pair with two reads oriented towards each other. The two 5’ endpoints are represented by the circular points, and the arrows show the direction of the sequencing read. Read pairs A and B are not duplicates of each other because their 5’ positions do not match. In contrast, C and D are duplicates because they share the same endpoints and read orientations. E shares a left endpoint and orientations with C and D, but E has a different right side endpoint with some soft-clipping on the right side; after adding the soft-clipped bases back to the read, E shares the same right endpoint as C and D so E is also considered a duplicate as well. F is not a duplicate of E because F’s read orientations do not match E’s orientations even though their endpoints match.

For read pairs that have one mapped read, and one unmapped read, Doppelmark considers the single mapped read a duplicate if the pre-clipped 5’ position of the mapped read matches a mapped read from another read pair X, whether X has both reads mapped or just one read mapped. In the following example figure, H (a single-mapped read pair) is considered a duplicate of G (a dual-mapped read pair) because H and G share a pre-clipped 5’ position and orientation for one of their reads. Similarly, I and J are both single-mapped read pairs that are considered duplicates of each other.

Doppelmark Approach

Doppelmark identifies duplicate read pairs using the above criteria in an efficient and scalable way by dividing the input data into partitions (shards) based on genomic position so that each CPU core can process shards in isolation from other shards as much as possible. Having each core process shards in isolation makes Doppelmark scale well with additional cores because there are few interactions between cores. Within each shard, Doppelmark does the following:

  1. Decompresses the input BAM file data.
  2. Identifies any read pair where R1 is in one shard and R2 is in another shard.
  3. Identifies the duplicate reads.
  4. Identifies optical (pad hopping) duplicates.
  5. Marks duplicate reads with new SAM flags and auxiliary tags.
  6. Compresses the duplicate marked output BAM records to be concatenated into the output BAM file.

In addition to carefully isolating shards, Doppelmark also avoids sorting the input BAM file because of the O (n log(n)) runtime cost of comparison sorting BAM records, where n is the number of BAM records. Instead, Doppelmark uses a hashmap to identify read pairs with identical 5’ coordinates and orientations, which results in an O(n) runtime while incurring a higher memory cost to keep each active shard’s hashmap in memory.

Design Detail

Sharding: Doppelmark divides the reference genome into a set of non-overlapping shards. Doppelmark can create shards that have equal genomic width, but this often leads to an uneven number of reads in each shard because of differences in sequencing coverage over the genome. Instead, Doppelmark uses the BAM index to create shards with approximately the same number of records in each shard. The .bai BAM index format does not contain enough resolution for truly even sharding, so we have also created a new .gbai index type for BAM files that Doppelmark can use to choose more evenly sized shards, which helps improve concurrency.

Identifying duplicates: Doppelmark processes each shard independently of each other. Within each shard, Doppelmark scans each BAM record and appends the record to a list stored as the value in a hashmap where the key is the tuple:

(left-reference, left-position, left-orientation, right-reference, right-position, right-orientation)

The resulting hashmap’s values contain the records that are duplicates of each other. After Doppelmark reads through all the records in the shard and adds each record to the hashmap, it then scans through all values in the hashmap and sets the duplicate flags and auxiliary tags of those records.

Distant-mates: One special condition to take into account is when a read pair straddles the shard boundary: we call these distant-mates. Distant-mates need special handling because when processing R1, Doppelmark needs the unclipped 5’ position and orientation of R2, but because R2 resides in a different shard, Doppelmark must look outside of R1’s shard to find R2 (and vice-versa). To make R2 available when processing R1, Doppelmark uses the following two strategies.

  1. Doppelmark reads a fixed number of genomic positions (called the padding distance) before and after each shard’s boundaries; this makes most distant-mates available during shard processing. For example, if shard S covers the region chr3:10000–20000 and the padding distance is 100, then Doppelmark reads records aligned from chr3:9900–20100 for shard S. In the following example figure, records A1 and B1 both have their respective mates outside of shard S; Doppelmark will process A2 and B2 while processing shard S.
  2. In a pre-scanning phase, Doppelmark finds each record C2 that is both considered a distant-mate of record C1, and outside of the padding region of C1’s shard S. Doppelmark saves distant-mate records like C2 and makes them available when determining the duplicates for shard S. The number of distant-mates is typically small so Doppelmark loads these records into memory when processing shard S and then frees them after completing shard S. During the pre-scanning phase, Doppelmark performs a sharded read of the entire input file to find and save these distant mates.

The following figure illustrates the padding for an example shard, and how each read falls into the shard, the padding, or outside the padded-shard. Most read pairs are like A and B; both read and mate are within shard S or S’s padding. For read pair C, one read C1 resides in the shard, and its mate C2 is a distant-mate, residing outside of the padded shard. The figure is not drawn to scale as the shard typically covers a much larger genomic region than the padding.

Writing Output BAM Records: After Doppelmark identifies all the duplicate read pairs, it writes the output BAM with the appropriate duplicates flags and auxiliary tags set. Doppelmark preserves the sequence of records as it reads them from each shard, so there is no need to sort the reads again because the input BAM is already sorted by position. The BAM file format uses BGZF compressed block compression, where many smaller blocks are individually compressed and then concatenated together. The BGZF format lends itself to parallel compression, and Doppelmark takes advantage of this by having each shard processing worker compress the shard’s output BAM records in isolation from other shards to best scale to many CPU-cores. Doppelmark then concatenates the compressed shards together into the final output BAM file.

Additional Features

In addition to the base duplicate marking functionality, Doppelmark also supports optical duplicate detection, which marks duplicates as “optical” when they are within a threshold distance from each other on the flowcell.

Evaluation

To evaluate the correctness of Doppelmark, we verified that Doppelmark’s output is equal to that of Picard and Sambamba, ignoring the aux tags that only Doppelmark uses. This evaluation took place with purely position and orientation-based duplicate marking for Sambamba because Sambamba does not support optical duplicates. When verifying against Picard, we also checked optical duplicate counts.

To evaluate the computational performance of Doppelmark, we ran a large 246GB whole-genome BAM file through Picard 2.11.0, Sambamba 0.6.6, and Doppelmark, and measured runtimes, CPU utilization, and memory use. These experiments were run one at a time, on the same dedicated physical machine, a 28-core (56 hyperthreads) 2.6GHz Xeon E5–2690 v4, with 256GB of memory and NVME storage.

Picard was unable to complete the run within 24 hours, so we do not have its full runtime information. Sambamba completed in 7.5 hours. Doppelmark completed in 36 minutes for a speedup over Sambamba of 12.5 times.

The following plot shows CPU utilization by Sambamba and Doppelmark while processing the test BAM file; “gbai” is the plot for Doppelmark, using a .gbai index. The plot shows that Doppelmark is able to use more than 40 of the available hyperthreads efficiently for the majority of its runtime. When Doppelmark has fewer shards than available hyperthreads, the CPU utilization drops because the hyperthreads become idle. Sambamba is able to use roughly 4 hyperthreads for most of its runtime, except for the last 75 minutes when it uses approximately 24 hyperthreads. Doppelmark is able to use the multi-core hardware at much higher utilization than Sambamba, giving Doppelmark its shorter runtimes on hardware with many available cores.

Using a hashmap to resolve duplicates comes with a higher memory cost. Sambamba has a peak memory consumption of 27GB for the 246GB test input file, while Doppelmark’s peak memory use is 76GB.

Source Code

The Doppelmark source code and .gbai index code are available on github:

Conclusion

The Doppelmark NGS duplicate marker is designed to take advantage of hardware with many cores to minimize runtime. Its design principles include:

  • Dividing input BAM into shards that can be processed with high parallelism.
  • Linear scans instead of sorting.
  • Hash maps instead of sorting.
  • Minimizing cross talk between threads, even during BAM compression.

These techniques allow Doppelmark to be both faster and less computationally expensive. Doppelmark is 12.5 times faster than Sambamba, the fastest existing mark duplicates tool we have tried. It also greatly reduces GRAIL’s computational costs in both time and expense.

In addition to higher efficiency, Doppelmark also supports added functionality, including optical duplicate detection. As of early 2021, Doppelmark has been in production at GRAIL for more than three years and is currently part of every production pipeline running within GRAIL.

References

  1. “Picard Toolkit.” 2018. Broad Institute, GitHub Repository. http://broadinstitute.github.io/picard/; Broad Institute
  2. Artem Tarasov, Albert J. Vilella, Edwin Cuppen, Isaac J. Nijman, Pjotr Prins. 2015. Sambamba: fast processing of NGS alignment formats. Bioinformatics.31(12):2032–2034. https://doi.org/10.1093/bioinformatics/btv098

--

--

Eric Scott
grail-eng

Associate Director of Bioinformatics at GRAIL Inc.