SMRT Leiden Bioinformatics: De Novo Assembly

Published in

PacBio

8 min readJun 14, 2018

We are on the last day of SMRT Leiden, a 3 day conference bringing the SMRT Community together to share their scientific discoveries and novel analytical achievements using PacBio sequencing. You can read about day 1 and day 2 here.

Today is the SMRT Informatics Developers Conference, whereas the previous two days focused on scientific research using PacBio, today’s conference is a mixture of bioinformatics talks and open discussion. I do my best at keeping notes of the discussion session, but inevitably not all that was discussed could be captured.

This is part 1 of the Developers Conference covering de novo assembly. The rest of the topics: structural variation, amplicon sequencing, and Iso-Seq, will be covered in the next post.

Sergey delivers the keynote on trio binning while Jim Drake represented the PacBio engineering team to bring you the latest news on software development.

Sergey Koren (NHGRI, GIS, @sergekoren) kicked off the keynote on the topic of TrioBinning (preprint), a novel approach for complete haplotype reconstruction. He stated that the common approach of sequencing inbred organisms is likely not the best one — inbreeding is not perfect and the two haplotypes composing the individual are rarely identical. This makes assembly harder and can cause downstream issues. Instead, Sergey and his group proposed a mindset shift: sequence the outbred individuals instead.

The TrioBinning approach works by first sequencing the parents using short reads and then the offspring using PacBio. PacBio reads are then classified into two parental clusters based on exact k-mer matching and assembled independently. This approach has several advantages including phase-specific Arrow-polishing using only the classified reads, gap filling with haplotype specific reads, and easier addressing of different repeat content.

Sergey and his team tested this approach on a number of different genomes, including Arabidopsis, human (NA12878), and an Angus x Brahman F1 bull, with success. On the NA12878 dataset, they compared the 10X Supernova (short read only) assembly and TrioCanu (PacBio only) assembly, and found that the PacBio assembly assembled the Alu and LINE repeats while the short reads couldn’t.

Sergey briefly mentioned quickmerge, which creates “pseudo-haplotypes” by merging haplotypes to create longer (but not always correctly phased) contigs for those who wish to see a longer N50 stat on their table… I mean, to span repeats that may break contigs in only one of the haplotypes.

Software mentioned in this talk:

· trioBinning

· Canu

· quickmerge

James Drake (PacBio) talked about the recent and upcoming SMRT Link/SMRT Analysis releases. This includes Iso-Seq3 (see my SMRTLeiden presentation) and pbsv 2.0 (see Armin’s SMRTLeiden presentation). But a lot of work went into making SMRT Link more user friendly, contain more data information and figures, better barcoding support, and finally, security. He encourages developers to look into SMRT Tools Reference Guide which gives command line-savvy users a lot more control.

Jim addressed a common naming confusion regarding PacBio’s consensus tools. “Quiver” and “Arrow” are both consensus algorithms. The command line tool is called variantCaller which employs both.

Jim briefly mentioned pbbamify, which can convert any arbitrary BAM file to a PacBio-compatible BAM file. Why do you need this? Because users may use aligners other than BLASR to generate alignments for consensus calling.

Jim also addressed a common confusion. Why do we have algorithms called Quiver and Arrow which are used to “polish” data into consensus sequences? What are the actual command line names for these tools? It should be understood that “Consensus” is a concept, the goal of which is “to fix noisy reads by piling them up”. To reach that goal of calling consensus, different algorithms have been invented throughout the years. For the PacBio RS II data it was Quiver, for the Sequel System data it was Arrow. variantCaller is the command line tool that knows when to call Quiver or Arrow, depending on which instrument the data was generated on. This is reflected in the SMRT Tools documentation where the ` — algorithm` parameter can accept `arrow` or `quiver`. CCS (circular consensus sequence), which also calls consensus sequences, is a single-molecule consensus; one CCS read is produced per ZMW (i.e., from one SMRTbell template with a single insert). Quiver and Arrow are multi-molecule consensus; they call a consensus given multiple ZMWs (i.e., from multiple SMRTbell templates with overlapping inserts).

Program usage description for variantCaller in SMRT Tools Reference Guide.

Jim acknowledged two remaining challenges in consensus calling: long homopolymer and diploid consensus calling. Headway has been made in both of these areas.

Jim spent the final part of his talk giving pitchfork — the beloved package manger tool that PacBio’s employee MJ maintained — a well-deserved farewell. We are now ready to put everything in… BioConda!!!! Yes, no more installation! Now installing BLASR is literally just one line:

conda install –c bioconda blasr

You’re welcome.

Software mentioned in this talk:

· SMRT Link and SMRT Tools

· FALCON and FALCON-Unzip

· pbsv 2.0 (presentation)

· Iso-Seq3

· pbbamify

· BioConda everything! (BLASR, LIMA, etc)

Brett Hannigan (DNANexus) evaluated haplotype phasing on FALCON-Unzip. They mixed CHM1 and CHM13 cell lines into an “in silico child”. The FALCON assembly alone aligned nicely to the GRCh37 reference. Applying FALCON-Unzip out of the box enabled complete and accurate separation of 72.4% of the assembled haplotig assemblies, where the remaining ones contained a mixture of haplotypes due to haplotype switching.

Sarah Kingan (PacBio, @drsarahdoom) presented FALCON-Phase, which is the brain child of Zev Kronenberg (Phase Genomics) and Sarah. This came out of this year’s PAG conference when, after Sergey and Adam presented their trio binning work, Zev and Sarah talked about what to do when parental data is not available. Here’s the background: FALCON-Unzip gives out “pseudo-haplotypes”, where a single contig can contain a mixture of the two parental alleles. Sergey and Adam’s solution (Trio Binning) was to use parental data to pre-bin the F1 child data into two sets and assemble independently. In the absence of parental data, Zev and Sarah proposed to use Hi-C data to correct the FALCON-Unzip phase switches. The workflow is as follows: “mince” contigs into pieces so each piece is pure (from only one parent), then use Hi-C data to identify the best order in which each minced piece goes. Minced pieces that are from the same parental allele will have a lot more Hi-C contacts than from separate parental alleles. It is important to note that FALCON-Phase does not correct assembly errors or rescue collapsed regions from FALCON-Unzip. For example, in the Angus x Brahman F1 cattle dataset, 90% of the genome is Unzipped, meaning 10% is collapsed and would not be separable (Sarah says the collapsed regions tend to be shorter). More than 80% of the Unzipped region was successfully assigned by FALCON-Phase. Using the TrioCanu result as ground truth, they determined 96% was accurately assigned by FALCON-Phase. Finally, they scaffolded the data with Hi-C data using Proximo from Phase Genomics and run FALCON-Phase again on the scaffolds. The result: diploid-phased, scaffolded chromosome-scale assembly! Sarah’s full presentation is here.

Software mentioned in this talk:

· FALCON-Phase

Georg Papoutsoglou (Bionano Genomics) presented a talk on the topic of “Beyond NGS: Bionano Genome Mapping for Genome Assembly & SV detection”. Bionano Genomics focuses on developing their technology using nanochannel arrays on silicon, which is suitable for structural variation detection and de novo genome assembly (hybrid scaffolding), with several exciting applications coming in the near future: whole genome methylation, replication of origin and CRISPR/Cas9 targeted approaches.

The de novo assembly session then opened up for discussion. The panelists included Arang Rhie, Sergey Koren, Zev Kronenberg, and Brett Hannigan, with Sarah Kingan as moderator. Here are some of the discussion I was able to capture:

· Jim Drake asked about the utility of incorporating transcriptome data (Iso-Seq data) as part of the assembly. I answered that I had actually looked into that for the Angus x Brahman F1 cattle and did not find Iso-Seq data to be particularly useful. The reason is the genes were at most 1MB long, and FALCON-Unzip actually already got phasing correct within 1MB. Doreen Ware then pointed out cattle is “simple” compared to plants. So, maybe, in the future, there will be a role for transcriptome data as part of the assembly process.

· I raised the question of assembling higher ploidy. Sergey thinks getting diploid assembly right — which he says we are only narrowly achieving with the trio binning approach — is more important before tackling anything harder.

· Zev mentioned that the 1000 genomes project took an orthogonal approach to phasing. Because human has a very good reference genome they first call SNPs then segregate long reads based on SNPs and then perform local assembly (preprint). Arang pointed out this approach will probably only work for humans that have very good references.

· Arang and Sergey stated that the Vertebrate Genome Project (VGP) will output genome assemblies in both FASTA and graphical GFA format. Raw data will be publicly available.

· What happens after assembly? Martin Pollard pointed out that most people want reliable genome annotation. Zev referred to Ian Fiddes’ CAT tool as a promising start for integrating genome transcriptome (Iso-Seq and RNA-seq) data for genome annotation.

· Sergey: Many tools developed by companies do not consider other technologies and don’t take an integrative approach. Better tools could be developed that integrate technologies together rather than applying each technology step by step.

· Brett asked if using non-BLASR aligners for polishing would reduce consensus performance. Ivan Sovic (PacBio) had done limited testing using minimap2 on yeast and rice and did not observe a difference, however the tool still needs to be tested on additional samples.

· They discussed polishing strategies. Arang recommended polishing only with PacBio reads using 40-fold coverage, with haplotype-phased reads if possible. She found best results with two rounds of Arrow. If you are coverage-limited and need to polish with Pilon, Sergey recommends fixing only indels (not SNPs) or using 10X read clouds.

You may continue on to part2: structural variation, amplicon sequencing, and Iso-Seq transcriptome sequencing.

SMRT Leiden Bioinformatics: De Novo Assembly

Written by Liz T