Visualizing the Chaos of Cancer, One Tool at a Time
Earlier this year, the paper “Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line” was published in Genome Research. The lead author, Maria Nattestad, was a PhD student advised by Mike Schatz at Cold Spring Harbor Laboratory (note: Schatz has since moved to John Hopkins University).
In summary, the authors sequenced both the genome and the transcriptome of the SK-BR-3 breast cancer cell line using PacBio SMRT Sequencing. They discovered more than 17,000 structural variants (of >50 bp in length), many being insertions that were undetectable using short reads. They were also able to identify multi-hop fusion genes that were confirmed by both gDNA and cDNA data. The result is an unprecedented in-depth look into the number of re-arrangements that could happen in just a single cancer cell line, perhaps serving as a viewing window into the difficult reality of cancer research.
What was missing from the media coverage — and to some extent, the paper itself — was just how much was going on under the hood in the software development arena to generate the results shown in this paper.
At the time, calling structural variation using long reads in cancer genome was uncharted territory. No one knew if existing SV callers for short reads would work (short answer: they didn’t). No one knew if long reads would reveal additional information that short reads didn’t (short answer: they did).
By the end of the SK-BR3 project, at least five new tools had been created. NGMLR as a long read aligner; Sniffles as a long read SV caller; Ribbon as a long read SV visualization tool; SplitThreader as a cancer rearrangement visualization tool; and Assemblytics as an assembly-based variant-caller. For those interested in a (recent enough) overview of long read tools, I highly recommend Sedlazeck et al. “Piercing the dark matter: bioinformatics of long-range sequencing and mapping”.
I was minimally involved in the SK-BR-3 project by helping run the bioinformatics analysis for the Iso-Seq (PacBio transcriptome) data. Once I mapped it back to the genome and made a list of candidate fusion genes, my job was done. After the paper was published, I wanted to talk to Maria and understand the tremendous amount of development work that went into it.
Structural Detection: Assemble or Remap?
According to Maria, the original idea for the project was to assemble the PacBio long reads and then call structural variants using the first tool Maria built for this project, Assemblytics. But it quickly became clear that using a remapping approach was a better idea. Maria confirms that she continues to consider remapping a better approach for SV calling for several reasons. First, assembly would require more coverage than remapping; rarely would a genome assembly project be done with less than 60-fold coverage, but using remapping it’s possible to get decent sensitivity at as low as 10-fold coverage. Second, it requires assemblers to “do the right thing” and not create artificial chimeric joins or break offs at SV locations. The authors confirmed this by doing a parallel analysis of assembly-based SV calling, where the assembly-based approach missed many of the long-range variants exactly because assemblers (in this case, FALCON), would break off at such branch points. By their account, the short-read assembly was even worse. Of course, the downside of using a remapping-based approach is if something doesn’t map well, or is not represented in the reference genome at all, then it won’t be reported.
In fact, most of the work done by Maria, Fritz, Philipp — the three developers for this project — is spent on going back and forth on fine-tuning the tools.
From BWA-MEM & Lumpy to Sniffles & NGMLR
Initially, the team had started out using BWA-MEM to map the PacBio reads with Lumpy for variant-calling, worked with those variant calls for a long time, then realized they were missing real structural variants. The problem was Lumpy would not split a read more than once, but PacBio reads often split multiple times! As Maria showed using her Ribbon tool, there can be 2-hop or even 3-hop fusions, so a single long read can be split multiple times.
Fritz Sedlazeck then joined the Schatz lab. I remember seeing him at ASHG 2015 and hear him talk about problems with trying to map PacBio reads that split multiple times. Shortly after, he developed Sniffles.
It was also at this time when Maria discovered the “HER2 story”. She still remembers manually tracking down the from-s and to-s of each long read for each HER2 copy and jotting them down on a piece of paper, before finally writing SplitThreader to automate the process.
When Fritz’s lab mate Philipp Rescheneder joined the lab for the summer, Maria & Fritz were toiling over BWA-MEM aligning straight through large structural variants despite the sudden onset of super-high error rates. No amount of tweaking the BWA-MEM gap penalty settings would make the aligner recognize large variants in a sea of small sequence errors. Traditionally, alignments have used either linear or affine gaps. Affine gaps use a gap open and gap extend penalty, which is problematic when trying to differentiate real SVs (long continuous gap) from small, 2–5 bp indel sequencing errors. During one coffee break, Maria recalled seeing a convex gap penalty in a particular implementation of Smith-Waterman and brought it up. Philipp implemented a more efficient heuristic version of that convex gap, which became the foundation for the NGMLR aligner.
With all the new tools in place, they finally settled on a long-read SV calling strategy: align using NGMLR, call variants with Sniffles, and visualize the results using Ribbon and SplitThreader.
So what did they find?
Heterozygosity in Copy Number Variations
Using a diploid baseline, the authors determined a varied number of copy number variations across the genome, including the SNTB1 (~69.2 copies), MYC (~38 copies), ERBB2 (~33.6 copies), TPD52 and (~24.8 copies). Some of these amplifications also become gene fusions, such as the SNTB1-KLHDC2 3-hop fusion gene shown earlier.
Structural Variation, Visualized
The NGMLR (align) → Sniffles (SV calling) → Ribbon (visualization) approach identified in the PacBio data 76,776 variants, of which 17,313 are SVs of 50 bp or larger. The strongest difference between the long read and short read data is far fewer insertions were called (51% were insertions in long read, vs 1.4% in short).
The authors found long-range variants — variants that are between different chromosomes, >10 kb apart on same chromosomes, or inversions — to be a strong suit of long reads. While long-range variants that were called by both long and short reads (using Sniffles and Survivor, respectively) were well-validated by PCR, when looking at unique variants, long reads had a higher validation rate (48.2%) than short reads (21.3%).
The HER2 Story, Reconstructed
The HER2 (also known as ERBB2) story is a remarkable story. Originally on chromosome 17, the long read data showed that it contained 5 translocations and 1 inversion to chromosome 8. Here is where Maria painstakingly used data and visualization to attempt to “reconstruct” the cell line history. Here is her hypothesis on the order of events:
· The orange segment (A-F) was the first translocation
· The yellow segment (A-E) came next
· The green segment was derived from the yellow segment, with an inversion placing a piece of chr8 inside the segment, just in time for the whole segment to undergo an inverted duplication
· The purple segment could be derived from any of the three segments above, sharing only variant site A
Fusion Genes, Confirmed
Using PacBio Iso-Seq (full-length cDNA) data, the authors identified 53 putative gene fusions, of which 39 had supporting genomic evidence. Of the 39, 15 are gene fusions with a genomic path between gene bodies; 19 overlap the first 15; 5 have gene paths >10 kb. Some examples of the Iso-Seq fusions included the CPNE1-PREX1 fusion that was previously found using RNA-seq data. The CYTH1-EIF3H fusion was also previously known, but it was not known it was a “2-hop” gene that weaved through a series of two variants.
And of course, there’s the “3-hop” gene that Maria said she could not believe “until 37 long reads showed all 3 variants in a row”. Previously reported as a 2-hop (600kb), the KLHDC2-SNTB1 fusion turned out to start from KLHDC2 through CEP112 to ZHX2 then SNTB1. There’s abundant gDNA and cDNA long read data supporting this 3-hop novel gene.
Long Read SV Calling, Then and Now
Like many studies using cutting-edge technology, by the time the work is published, things have changed. This work was done on the PacBio RS II platform; the Sequel platform has a lot more throughput, and a newer version is promised to further increase yield. On the tools side, NGMLR and Minimap2 are both popular long read aligners; PacBio now has its own SV calling tool (PBSV) while Fritz continues to improve Sniffles (and maybe I’ll just mention that the great ape paper authors had their own flavor called SMARTIE-SV).
After graduating from CSHL, Maria started OMGenomics, wrote visualization software and made amazing tutorials for a year or so, then joined DNAnexus and continues to bring complex data to life through data visualization. She may not be working on structural variation anymore, but I have no doubt she will continue to make great contributions to science, and that many will find the important work she’s done for the SK-BR-3 project invaluable!
NOTE: The list of paper and tools mentioned in this post, including PacBio data, can be downloaded at http://schatz-lab.org/publications/SKBR3/