SMRT Leiden Bioinformatics: SV, amplicon, and Iso-Seq

Published in

PacBio

6 min readJun 14, 2018

We are on the last day of SMRT Leiden, a three-day conference bringing the SMRT Community together to share their scientific discoveries and novel analytical achievements using PacBio sequencing. You can read about Day 1 and Day 2 here.

This is part 2 of the Developers Conference covering structural variation, amplicon sequencing, and Iso-Seq. For part 1 covering de novo assembly, please read here.

Fritz Sedlazeck (Baylor) presented his experiences on scaling up SMRT Sequencing and other technologies to large cohorts studying structural variants. He detailed improved sampling techniques to gain the most out of the sampled humans for maximizing novel discoveries. He illustrated how DNA quality strongly affects 10X results, whereas SMRT Sequencing shows more robust and consistent results for large-scale structural variant inference. He highlighted a number of tools he developed for inference of structural variants. Finally, Fritz had recently written a review article titled “Piercing the dark matter: bioinformatics of long-range sequencing and mapping” that does an excellent job at laying out the current bioinformatics landscape of the long read field for genome assembly, structural variation, and transcriptome sequencing.

Software mentioned in this talk:

· SVCollector (preprint)

· CrossStitch (requires HapCut2)

· Brief mentions of NGMLR, Sniffles, SURVIVOR, Clairvoyante

David Heller (Max Planck) illustrated his graph-based approach, SVIM, for calling structural variants using long reads. By decomposing alignment information into basic evidence clusters, evidence from different regions can be combined in a composable fashion to yield high precision at low coverage. David has shown how SVIM outperforms the competition like sniffles.

Software mentioned in this talk:

· SVIM

The structural variation session then opened up for discussion. The panelists included David Heller and Armin Toepfer, with Fritz as moderator. Here are some of the discussion I was able to capture:

· I asked if there’s any kind of confidence value (p-value? QV?) for called SVs. All three developers say their tools provide some confidence value in the form of number of supporting reads, alignment confidence, etc.

· Zev asked if there’s any advantage to integrating over different aligners. So far, all SV tools are exclusively married to a single aligner. Fritz and David favor NGMLR while Armin favors minimap2 for its speed.

· What is the minimum coverage required for SV? Fritz puts 10–15-fold for diploid genome. Armin recommends parent 5-fold, child 10–15-fold, jointly call would get you 80–90% sensitivity (he showed a fold-coverage evaluation in yesterday’s talk).

· Yahya points out SV tools need to annotate their results. EX: if a CNV has 20 copies but the 18th copy has a sequence mutation, it could be related to disease.

Adam Ameur (Uppsala, SciLifeLab) presented many approaches for targeted SMRT Sequencing, each of them with different applications. The first example was the BCR-ABL1 fusion gene, which is a drug target for Chronic Myeloid Leukemia (CML). Traditional mutational screening for BCR-ABL1 uses nested PCR and Sanger sequencing; problems with this approach include that it cannot identify low frequency mutations, the PCR step introduces biases, and it gives no information on isoforms. Instead they sequence the BCR-ABL1 gene using the Iso-Seq method and have replaced the Sanger screening. Another example is looking at the TP53 for identifying mutations. He then briefly mentioned using hybridized probes for doing targeted capture. The next application was using rolling circle amplification (RCA) to sequence HPV genomes. Moving away from PCR, he’s been collaborating with PacBio using CRISPR/Cas9 for no amplification sequencing. Lastly, he described the Xdrop from Samplix, a microfluidic droplet enrichment method. They’ve used Xdrop to target for HPV18 integration into the human genome. And just to join the whole genome sequencing club, Adam mentioned that they’ve recently sequenced two Swedish genomes.

Publications mentioned in this talk:

Cavelier et al. “Clonal distribution of BCR-ABL1 mutations and splice isoforms by single-molecule long-read RNA sequencing”. BMC Cancer (2015)
Lodé et al. “Single-molecule DNA sequencing of acute myeloid leukemia and myelodysplastic syndromes with multiple TP53 alterations”. Haematologica haematol. (2017)
Tsai et al. “Amplification-free, CRISPR-Cas9 Targeted Enrichment and SMRT Sequencing of Repeat-Expansion Disease Causative Genomic Regions”, biorxiv (2017)
Ameur et al. “De novo assembly of two Swedish genomes reveals missing segments from the human GRCh38 reference and improves variant calling of population-scale sequencing data”, biorxiv (2017)

Figure from Armin’s talk where he showed PPV for the demultiplexing algorithm LIMA at difference score cutoffs. Even at relatively low scores (20–40), the barcodes are still correctly called.

Armin Toepfer (PacBio) returned to talk about the demultiplexing tool LIMA that he’s designed. He outlined 4 different barcoding strategies. There are 384 validated barcodes available which allow indexing of 384 samples in symmetric mode or 384x384 combinations in asymmetric mode. He showed the improved sample set up in SMRT Link version 5.1. Armin’s full presentation can be found here.

The Targeted Sequencing session was then open for discussion. Adam Ameur moderated with Armin in attendance:

· Should targeted analysis be performed on CCS or subreads? Adam prefers to use CCS reads.

· How can chimeras in amplicon sequencing be reduced? Adam thinks the Xdrop is promising. For cDNA amplicon it’s possible to incorporate UMIs.

· Adam pointed out that as we move towards PCR-free amplicon sequencing, we should start thinking about utilizing epigenetic signals from the sequencing data.

· Phasing on amplicons? Adam may try WhatsHap. Fritz says his tools should work too. It should be noted that PacBio’s GitHub currently has an approach as well.

The final session of the day was the Iso-Seq Analysis session.

I gave a brief presentation on Iso-Seq3, the list of current Iso-Seq community tools, and a recent comparison between GMAP and minimap2. My full presentation is here.

Richard Kuo (Roslin) then took the stage. I’ve known Richard since he was working on the chicken Iso-Seq project. For sequencing the chicken brain, he went beyond the regular Iso-Seq recommendations (Clontech SMARTer) and did 5’ cap trap and normalization. He showed that normalization can yield close to 5 times as many genes, most of which are rare lncRNAs. And by using 5’ cap trap, he can capture the true 5’ end of the RNA. He is currently working with Lexogen on optimizing the normalization protocol. He presented TAMA, his own set of scripts for post-processing Iso-Seq data that gives users advanced control over how to collapse transcript alignments, merge Iso-Seq data with RNA-seq data, and predict ORFs and NMDs. Richard’s full presentation is here.

Slide from Richard’s talk showing library normalization recovers more genes with fewer reads.

Discussions with the audience happened throughout the talks and I summarize them here:

· Pooling different tissues can also be seen as a form of normalization and is common for plant and animal genome annotation projects. Richard remains skeptical that the ultra-low abundant transcripts would be recovered this way. Right now, there are no direct comparisons.

· There is interest in doing single cell Iso-Seq analysis. There have been some preliminary attempts on this by some folks with success, though library optimization is still in progress.

· For certain applications, incorporating UMIs would help identify artifacts and aid in quantification. It should be noted that Karlsson & Linnarsson already did a proof of concept study on mouse brain single cell Iso-Seq sequencing in 2016. Theoretically it’s entirely possible to do both single cell and UMIs using the Iso-Seq method. Bioinformatically, I think it will take a bit — but not a lot — of work to create or modify existing tools to process the data. It is definitely an exciting new area to explore!

This concludes the entire SMRT Leiden conference. I hope you’ve enjoyed it! You can look for organic tweets on Twitter of this conference using the hashtags #SMRTLeiden and #SMRTBFX

[2018/07/11 Update: Speaker presentations for the event are now online]

SMRT Leiden Bioinformatics: SV, amplicon, and Iso-Seq

Written by Liz T