SMRT Leiden 2019 Day 3: Bioinformatics for Transcriptome, Genome Analysis, and Targeted Sequencing
Gene Myers (@TheGeneMyers, MPI-CBG, Germany) delivered the keynote for the last day of SMRT Leiden 2019, which is dedicated to bioinformatics.
Having published two challenging genomes using PacBio long reads — the planarium genome (which is 70% AT content) and the axolotl genome (32 GBp, the largest genome sequenced to date) — Gene thinks we are not far from obtaining high-quality, telomere-to-telomere assembly with reasonable cost (~1000 EU).
Gene is involved in the VGP genome assembly efforts, where, despite combining multiple sequencing technologies (PacBio, BioNano, Hi-C), they are finding that highly repetitive genomes remain the most difficult ones to achieve satisfactory assembly contiguity. He thinks there are two ways to improve assembly:
(1) Longer or more accurate reads
(2) Better algorithms
a. ex: scrubbing to remove artifacts
b. repeat/haplotype separation based on heterogeneity
c. repeat detection and modeling
With the recent release of PacBio HiFi (CCS) reads, Gene pondered which is better for de novo assembly — long reads or shorter, but more accurate reads?
“I am the champion of HiFi” Gene claimed. He considers the relatively higher cost of CCS a short-term problem and stresses the advantage of easier DNA extraction for preparing a 15 kb HiFi library over a 50 kb CLR library.
He is in favor of the HiFi approach for two reasons:
· More accurate reads, such as reads with 0.5% error, makes alignment much easier and more precise.
· Even though more accurate reads may be shorter and not span through longer repeats, he is confident that micro-heterogeneity within the repeats can be used to separate them — though this part is still to be developed!
Gene also provided an updated version of his aligner DALIGNER 2.0.
Importantly, k-mer sizes could be increased when using HiFi data due to its higher accuracy. Similar to the minimizer approach, Gene proposed a “modimer” approach (courtesy of R. Durbin) that uses only a subet of the k-mers. He showed how k-mers could be classified by creating a count histogram: unusually abundant k-mers are repeats, rare k-mers are errors, while haplos and diplos could be identified based on expected read coverage. For assembly, he is only using haplo-mers and diplo-mers (implicit repeat masking!).
He then switched gears to focus on an important, but often neglected step of achieving good assembly results: scrubbing to remove chimeras, adaptamers, and low-quality dropouts. He walked us through his DASCRUBBER approach which consists of the following steps:
· DAScover: Computing coverage from read pileup
· DASqv: Computing intrinsic QV values
· DAShq: Identifying high-quality segments
· DASgap: Identifying gaps between HQ segments that indicate artifacts
· DASpeel: Remove all LAs that can’t be explained as overlaps
· DASvote: Detect variants and correct reads
The concept of scrubbing is summarized in Gene’s own blogpost.
Gene concludes that PacBio HiFi (CCS) reads have the potential to improve genome assembly and are likely to be effective at haplotype phasing.
· TAMA Collapse: Mapped reads to transcript annotation
· TAMA Merge: Merge annotations (can combine PacBio with references ex: ENSEMBL)
· ORF/NMD Predictor: Identify coding regions and associate with known genes
· TAMA-GO: Helpful tools
TAMA documentation is available on the TAMA GitHub wiki.
Ana Conesa (@anaconesa, UFL, USA) presented three Iso-Seq downstream tools.
SQANTI is a quality control pipeline that can categorize Iso-Seq data against a reference annotation. It allows users to see which genes/transcripts are novel/known and offers detailed annotations on canonical/non-canonical junctions.
(NOTE: I have implemented an augmented version of SQANTI called SQANTI2.)
IsoAnnot offers functional annotation at the isoform level. It annotates the isoforms with protein domain information.
TAPPAS is a Java-based application that creates beautiful visualizations utilizing information at both the transcript and protein level. It can identify differential expression at both the isoform level and the gene level! Ana showed some amazing examples of genes that had differential isoform expression at different stages of neural development in mouse.
Richard, Ana, and I led the open discussion session. Key issues discussed were:
(1) How do we detect and remove cDNA artifacts? Ana has looked at this extensively in their SQANTI study, identifying non-canonical junctions arising from RT template switching as one of the major artifacts. Proper filtering is necessary.
(2) Do we need to cluster the full-length reads? Richard’s TAMA pipeline skips the Iso-Seq clustering step and only uses the classified full-length non-concatemer (FLNC) reads. I agree that we are reaching the point where single CCS accuracy for the FLNC reads are likely removing the need for clustering — in the future, we may be able to recommend directly mapping FLNC to the genome for annotation. However, I warn against trusting all mapped FLNCs due as many of the rare/solo FLNCs may be artifacts.
Whole Genome Analysis Session
The CCS processes consists of the following steps:
· Creating a draft consensus sequence using POA (Partial Order Alignment)
· Ordered subreads are mapped to the POA consensus for consensus calling
Both steps are quadratic in length, which becomes a computational bottleneck with increasing insert lengths.
For both POA and consensus stage, split the input into windows, so separate POA and consensus processes could be run in parallel. The result? An 8–10X reduction in runtime! Furthermore, the new version has equivalent or better accuracy compared to the old version.
The latest CCS algorithm is available via BioConda and is part of the recently released SMRT Link 7.0.
Jana Ebler (@Saar_Uni, Germany) presented a new way of genotyping noisy long read data. They defined the problem of haplotyping as a bipartitioning of the input reads into the two alleles using a Hidden Markov Model (HMM) framework. Each position is represented by all possible assignments of alleles into two partitions and an additional encoding to reject incompatible bipartitions. Every path in the HMM corresponds to a bipartition of all reads and a sequence of allele assignments. They evaluated their method on the NA12878 PacBio and ONT data using GIAB high-confidence SNP call set and achieved 99.79% and 98.02% genotype concordances for the PacBio and ONT, respectively.
Jana’s work will be published in Genome Biology soon. The preprint is available.
Fritz Sedlazeck (@sedlazeck, Baylor College of Medicine, USA) presented his new software tool called Princess. Princess is a Snakemake pipeline that does the following:
· SNP calling using Clair (successor of Clairvoyante).
· SV calling using Sniffles.
· Phasing SNPs + SVs
· Methylation (planned)
Armin Toepfer (@XLR, PacBio) provided an update on PacBio’s pbsv application for structural variation detection. The latest version supports all major SV types (deletion, insertion, duplication, inversion, translocation, copy number variation) with joint-calling capabilities. Using HG002 dataset as benchmark, pbsv outperformed other existing SV calling for both precision and recall. Importantly, even at 5-fold HiFi (CCS) data pbsv achieved >90% recall! Pbsv is also accurate at predicting breakpoints by first creating a draft consensus then re-aligning reads against a reference using a breakpoint aware aligner.
pbsv is available via BioConda and SMRTLink.
Targeted Sequencing Session
Sven Warris (@swarris, Wageningen University, Netherlands) presented his group’s use of long read amplicon sequencing for screening fungal genes of interest in a high-throughput fashion. They used a dual barcoding strategy to multiplex samples and analyzed the data using HiFi (CCS) reads. The amplicons ranged from 500–2,000 bp and ~83% of the reads have >10 passes, generating highly accurate reads. To be able to identify barcodes on the lower quality CCS reads, they implemented a SW aligner that worked at subread-level accuracy (pyPaSWAS) and could align more reads than BWA.
Yahya Anvar (@anvarak, LUMC, Netherlands) showed how artifacts — PCR errors, non-full-length, and most importantly, PCR chimeras — make de novo allele identification difficult. In some cases, the chimeras can be as abundant as the true alleles! Yahya presented a clinical pipeline (to be published) that would filter chimeras. The pipeline uses accurate CCS reads, filters away low-quality reads, clusters reads based on similarity, phase, then identifies variants, and finally assess allele quality & integrity before passing the data to the clinicians.
Day 3 concludes SMRT Leiden 2019! If you are interested in the previous two days, please go to: