Hidden in Plain Sight — A Bioinformatics Journey into Structural Variation Calling in the Long Read Sequencing Era
I first met Fritz Sedlazeck (@sedlazeck) in 2015 at the American Society of Human Genetics (ASHG) meeting. It was at a bar in Baltimore outside the conference center where a few of us from PacBio and folks from the Schatz lab (then at CSHL) were hanging out. At the time, Fritz and Maria Nattestad were just beginning to work on an SKBR3 cancer dataset generated on the PacBio RS II. As this was the early days of long reads, the project presented multiple computational challenges, including the lack of tools for long-read alignment, structural variation calling, evaluation, and visualization.
Those challenges turned into the development of multiple tools, including NGMLR (long read aligner), Sniffles & Assemblytics (SV calling), and Ribbon & SplitThreader (visualization). The SKBR3 study has since been published and its informatics journey written up in my Medium interview with Maria.
Five years later, Fritz is now a professor at Baylor College of Medicine and continues to focus on structural variation (SV) detection using long-read technologies. A few months ago, his postdoc Medhat Mahmoud (@MedhatHelmy7) authored the review paper “Structural variation calling: the long and short of it”, detailing the state of SV calling using long and short reads. The paper is written in an organized and accessible manner — I recommend all to read it before this blog — but it presents only a static snapshot of the ever-shifting landscape of bioinformatics tool development and cutting-edge research. With this blog post, in addition to providing insights into structural variation research, I hope to paint bioinformatics tool development in a humanized light and highlight some of the questions that are rarely asked: What is the birth story of a bioinformatics tool? Why do certain tools live, while others die?
Structural Variation: The Long and Short of It
Structural variation is defined as genomic variation that is 50 bp or larger. Their importance has implications in both human diseases and crop improvement. For example, the X-linked Dystonia-Parkinsonism, a disease endemic to Philippines Panay islands, was found to have a SINE-VNTR-Alu retrotransposon insertion in the TAF1 intronic region that causes intron retention and reduces TAF1 gene expression (Aneichyk et al. 2018). Maria’s SKBR3 cancer study showed how chromosomal rearrangements lead to fusion genes. In tomato, a tandem duplication corrected an undesirable branching trait brought on by previous breeding efforts (Soyk et al. 2019).
The history of SV detection is as old as the sequencing age itself. Yet, until the advent of long reads, it had been difficult to detect these large-scale events. As detailed in the review, short read mapping often is ambiguous for structural variants larger than the read length, especially in repetitive regions of the genome. Long reads span many more large variants, but brought on different kinds of computational challenges. Hence, Sniffles was born.
Sniffles: A Comprehensive SV Caller Made for Long Reads
Liz: Sniffles and NGMLR started because of the SKBR3 cancer genome project, right? What were you working on before you joined Mike Schatz’s lab?
Fritz: I was already working on structural variants in another lab, which formed the basis of the SURVIVOR tool. I think my expertise in bioinformatics and SVs probably prompted Mike to hire me on the spot. Having already developed a short read mapper for my PhD (NextGenMap, NGM), I was hesitant to develop a new method. Meanwhile, Maria was struggling to get Lumpy to work. It was clear that Lumpy wasn’t designed for long reads, so I quickly put together Sniffles to robustly identify translocations in the HER2 region.
Then, we started to notice that BWA-MEM was failing to align properly at certain regions. Around that time, Philip Rescheneder was looking for a summer internship, and within two months he built the prototype for NGMLR.
Over the next year, we worked to refine the methods, which was a tedious process since there’s no ground truth. We were forced to use IGV to manually verify the alignments. At some point Mike was a bit annoyed with me, to say the least, for spending so much time on IGV.
Liz: The life of every bioinformatician!
Fritz: He wanted me to work on the tool in a systematic fashion, but I would be spending hours and hours on IGV trying to decide if an alignment was a false positive or not.
Liz: Since NGMLR and Sniffles has been published, what has changed with the advances in long reads in terms of both accuracy and length?
Fritz: The development of NGMLR is currently stalled as we find a developer to update it. We are working on a hybrid method together with Tod Treangen (Rice University) that combines the faster minimap2 with NGMLR to speed up the alignment process. For Sniffles, the biggest change would be now allowing forced genotyping given a user input VCF, which is necessary for branching into the population space.
De Novo Assembly vs Mapping: Which is Better for Structural Variation Detection?
Liz: There are two approaches to SV detection. One is de novo assembly, the other is mapping (resequencing). How do you see the pros and cons for each approach?
Fritz: One of the main points I hope would come across in the review paper, is that while we might think that an assembly-based approach would be free of the mapping bias issues we see in resequencing, it is actually pushing the issue downstream to being a genomic alignment problem.
Liz: I like how you are pointing that out!
Fritz: Right? Whole genome alignment is not as solved as the mapping problem.
Liz: We could almost say that the Table 1 in the review paper, where you listed “de novo assembly”, it’s actually “assembly then alignment” — that’s a more precise description of the route it is taking.
RNA Sequencing: A Functional Take on Structural Variation Detection
Liz: The review paper had a section on RNA-seq (short and long) for SV calling. As someone who works exclusively on long read RNA-Seq, I don’t think of RNA-seq as a SV calling approach. Rather, I think of it as a confirmatory tool of SV calls, because it detects only a subset of the SVs and is limited by the expression of the genes which is tissue- and condition-dependent.
Fritz: I think of RNA as a spin to the SV story. One may argue that while comprehensive detection of SVs is an interesting problem, most of the SVs may have no direct functional impact. In cancer, for example, gene fusions play an important role, and in many cases using the RNA is easier to detect them than the DNA.
Liz: Right. There’s the example from the XDP paper on the TAF1 gene that showed an intronic SV affecting splicing. Then we have the SKBR3 paper showing all the gene fusions.
Fritz: In fact, Mike’s lab just had a preprint that came out (Aganezov et al.) showing SVs identified in medically relevant genes using long reads but missed by short reads. In Figure 6 of Aganezov et al., it shows four examples where long reads were able to identify the insertions but were missed by short reads.
Liz: Do I understand correctly that with short reads, the challenges are mappability, calling insertions and interpreting complex events? Also, it tends to have a high false positive rate?
Fritz: Yes. Another potential source of false positives is that aligners such as BWA-MEM will sometimes report higher mapping quality than it actually should. This is due to the fact that MQ is enlarged if two pairs can map close to the expected distance.
A Current View on the Landscape of SV Tools
Liz: Let’s have an honest discussion about Table 1 in the review paper, where you listed all these SV tools for long and short reads! Just how many of them do you think are still being actively developed and used?!
Medhat: The original table was a lot bigger!
Fritz: We cannot forget that structural variation calling didn’t just come about in the last few years — it started almost at the same time as sequencing became available. The early SV calls were questionable due to lack of ground truth and being short read-only.
Medhat: The SV tools also matured as time went on. At first, SV detection was only based on deviation from the estimated insert size. Then, other information — coverage and split reads — were incorporated. Finally, there were different approaches — assembly-based, mapping-based, graph-based — for SV detection. The original table attempted to capture the progression and popularity of different detection approaches, but it was not logical to include all of them in this review paper.
Liz: Not only that, many of the tools are probably no longer in use!
Medhat: Precisely. Some tools died very soon. We trimmed the table down to tools that are still being maintained and could be suitable for today’s users. Most importantly, we tried to reflect on the uniqueness of each tool.
Fritz: The table isn’t trying to be comprehensive.
Liz: Right. Let’s go through this table section by section! De novo assembly — I didn’t know about SGVar, which is a graph-based short read caller. I knew about Assemblytics from Maria’s project. Smartie-sv was developed by Zev Kronenberg and I don’t think it’s maintained anymore. Paftools is from Heng Li, right?
Fritz: Assemblytics is still good, despite limitations. Paftools (part of minimap2) requires an assembly from a closely related species to work. We included SGVar and Hysa because they construct the whole graph and do direct SV calling on the graph, which is distinctly different from say, Assemblytics, which takes MUMMER alignments, to call SVs.
Liz: Let’s move on to the short-read mapping-based tools. The list here is quite long. Which of the tools have you heard the most noise about?
Liz: Now my favorite topic — long-read mapping! There’s NanoSV, PBHoney, PBSV, SMRT-SV, and your own tool Sniffles. I just went to the GitHub page for SMRT-SV and it says it’s no longer maintained. Oh wait, that’s because it was originally written by Mark Chaisson when he was a postdoc at the Eichler lab. There’s now SMRT-SV2 which is being maintained actively by Peter Audano.
Fritz: PBHoney was really one of the earliest long read callers; it didn’t catch on fire likely because it was ahead of its time and there was limited long-read data. The next tool may have been Sniffles, which is almost 5 years old now! Since then, lots has happened. We now have Genome In A Bottle (GIAB) which is giving us truth sets. You guys at PacBio developed PBSV, which was started by Aaron Wenger and brought up to another level by Armin Töpfer. Am I offending Aaron by saying that?
Liz: Not at all. Armin also re-wrote my prototype for IsoSeq and he made it much better, too! Switching topics — if I were to make a table for long read de novo assemblers, right now that table would be much, much bigger than the few tools we just listed. Why do you think that is?
Fritz: Long-read sequencing came along around 2012 and the first application it caught on was de novo assembly, which continues to be the major use of long reads.
Liz: Do you think it’s also because there’s less awareness of structural variation as something to study?
Fritz: Likely. I would say before 2013, there were few papers highlighting SVs. In the last four or five years, the number of papers on SV has really spiked.
From Egypt to Poland to Texas: Medhat’s Scientific Journey
Switching to a lighter topic, I asked Medhat how he became Fritz’s postdoc and what he was doing before. Medhat and Fritz exchanged looks and immediately started laughing. The story turned out to be an adventure of its own.
Medhat: I first contacted Fritz when I was doing my PhD in Poland and analyzing structural variation in maize using PacBio data. Sniffles was unpublished at the time, but I had seen a poster. I emailed him and we started communicating back and forth. He was interested in what I was doing — I had trouble interpreting the results and he was like “What the heck are you doing?” and I kept asking him about Sniffles to get better results so I could complete my PhD project.
Fritz: I mean, this poor guy was tasked to do a maize assembly on like, 3x PacBio data!
Liz: So basically, you’re saying Fritz helped you complete your PhD!
Medhat: Yes, he helped me to tackle the problem of SV calling, as I was new to the field at that time! After my PhD, I worked at a company for a while, then Fritz had his own lab. I wanted to do a postdoc, but avoided contacting him, because he had…certain requirements that he listed and I saw things as black and white, I thought “I would not fit in his lab. I will not apply.” But I still liked his work, that was the problem. After two months of back and forth, I decided, “OK, I’m going to gamble. I’m going to tell him what I can do and what I cannot do and see if we could work or not.”
Fritz: I’m going to say that I agree with Medhat. My whole life is “go and gamble”. The worst thing that can happen is a “no”. It cannot hurt to apply to a position that you think is too high for you. You can learn something from it.
Liz: I really like that. When occasionally asked to offer career advice, I always say “the job description is just there to fill the space.”
Medhat: So, Fritz and I had a Skype interview in early 2018 and he accepted me into his lab. The problem was we didn’t know how long it would take for me to obtain my visa. I worked in a company in Poland for eight months where I developed a metagenomics workflow for long reads, then my visa was granted, and I came to the US.
Fritz: I want to mention that one of the reasons I hired Medhat is his passion for science. He’s an Egyptian, you know, and if you read his CV, you see that he’s spent his life doing (sometimes) strange projects and degrees to find his niche in science. Like, identifying fauna in the desert [laughs]. After a few bachelors and masters and a stint in software engineering, he tried to do a PhD. But that was during the Arab Spring movement and opportunities domestically were limited, so he went from Egypt to Poland, because that was where he could do a PhD. To me, that shows his passion for science. I mean, I don’t know anyone who would move from Egypt to Poland for a PhD!
Liz: Quite a change in weather, too!
Fritz: The stories of him in Poland would fill up another blog post! There’s apparently a giant picture of him in the town where he did his PhD, because he was one of the few non-European foreigners in the town.
Medhat: I was quite unique!
PRINCESS: A push-button framework for scalable SNV and SV calling at population level
Liz: Tell me how the PRINCESS project came about. Whose idea was it?
Medhat: Fritz had asked me to create a framework that would take in long reads in FASTA/FASTQ format and extract as much SV information out of it. I wanted to make it user-friendly with minimal installation requirements and produce results that are easy to digest.
Fritz: After Medhat incorporated my tools (NGMLR, Sniffles), we wanted to see what the next big thing would be. We were collaborating on a project for calling phased SNVs in Mendelian cohorts, which sparked the idea that our tool needs to provide comprehensive detection — including phasing, methylation, and SNV/SVs — for sequencing data.
Liz: In your recent presentation at PAG 2020, you benchmarked PRINCESS on ONT and PacBio data showing performance at difference coverages. What is your current recommended coverage for SV calling?
Fritz: This was our initial foray into answering how much coverage we need to do population-level SV calling. For PacBio CLR and ONT, I would recommend 15-fold coverage. With PacBio HiFi which has higher accuracy, it should be possible to go down to 10-fold coverage.
Looking to the Future — Phased Assemblies and Population SV Calling
Liz: What do you see in the future for SV calling?
Fritz: We now have tools for phased assembly, but it is neither push-button nor scalable. Taking a phased assembly and doing a whole genome alignment — that’s the next thing that has to be solved.
Liz: That’s a really good point! Developing a tool for phased assembly would require having phased genomes to work with and we are just beginning to see data in both the human and AgBio space. Coming out of PAG 2020, it would be interesting to see how phased assembly works for polyploid plants. There are also these vertebrate genomes that are coming out of the VGP project.
Fritz: MUMMER has done a great job at making genomic alignments easier, but the next challenge is multi-species comparison. For the VGP project, having a multi-species alignment would allow tracing structural variation. Then, once you have the genomes annotated, you can see which SVs happen in the gene space.
Liz: How do you see the use of long and short reads in population level SV calling?
Fritz: In the near term, large population studies will likely continue to be done on short reads. We can then take a subset of the samples to do long reads and use it to validate the short-read calls, since as we know short-read calls can have high false positive rates. The long reads would call additional SVs missed by short reads. We have been working with a group in Illumina (Mike Eberle) to have an efficient SV genotyper (Paragraph) then on short reads. We are currently in the process to write this entire pipeline up with examples for CCDG a cardiovascular disease cohort.
Liz: Thank you both for the interview. SV detection using long reads is a fast-growing application, and I’m excited to see what lies ahead!