SMRT Leiden 2019 Day 1: Illuminating Life with Better Sequencing

Technology changes the way we perceive our universe.”

— Paul Hebert, keynote speaker for SMRT Leiden 2019

Paul Hebert (University of Guelph, Canada) kicked off SMRT Leiden 2019 with an inspiring keynote talk on the international Barcode of Life (iBOL) project.

“What do we risk losing with the extinction of species?” Paul began. “Every loss of species is an irrevocable loss of genomic knowledge.”

And we are racing against time.

The May 2019 Science cover: Betting on Biodiversity.

The iBol consortium, launched in 2007 out of Leiden, had the original mission to develop and deploy a DNA-based identification system for animal, fungi, and plants. Phase 1 of the project was to “barcode” life. The Barcode 500k project — using 648bp of the mitochondrial cytochrome oxidase I gene — was completed in 2015. In addition to delivering the DNA barcode for 500,000 species, it developed an automated species recognition & discovery system called Barcode of Life Data System (BOLD), which forms the foundation of the next two phases.

Phase 2 of iBOL is BIOSCAN. Whereas Barcode 500k registered known species, BIOSCAN focuses on scanning for the unknown. Paul gave an example of just how much we don’t know. Previously, there were 29,985 known Canadian insect species. Using the DNA barcode approach, they identified 64,400 barcodes. This puts the global insect species diversity closer to 10 million compared to a prior estimate of 6.5 million…and that’s just insects!

Insects alone have a huge diversity. A recent DNA sequence-based estimate puts the number to 10 million! Photo by Егор Камелев on Unsplash

“We want to know every species,” Paul says of the BIOSCAN project. But he doesn’t stop there, “We also want to know the intersection between every species. Finally, we need a global biosurveillance system.”

Here is where technology, specifically, PacBio’s accurate long reads, come into play. Originally based on Sanger sequencing, iBOL moved to using the PacBio Sequel System for species discovery recently. They found that the Sequel System was not only cost-effective, but also more accurate at delivering species-level information. The long read lengths and lack of bias meant it was capable of blasting through homopolymer regions. According to Paul, a single Sequel System can barcode 5 million species per year and run species discovery for 500,000 species per year.

Moving to the second goal of BIOSCAN — species interaction — long reads again show their advantage. Environmental samples are messy! Paul discovered that some of the DNA mixtures of insects have DNA from animals they feed on (elephants, kangaroo, even nematodes). So, to characterize comprehensively they extract the DNA of whole species, amplify with primers from different taxa, sequence deeply (>3000-fold), and voila — you have a species symbiome!

For example, they sequenced lice species from human, monkey and chimp, and showed that there are six lineages of lice on humans with a high sequence divergence of up to 8%! What caused this accelerated evolutionary race in lice? One hypothesis is that different lice species intermingled along with the early hominin migration.

The last piece of the puzzle — biosurveillance — will be done in large scale by grinding up bulk, mixed species samples, fragment DNA, and sequence short barcodes for each sample.

Paul envisions that, just like the Svalbard Global Seed Bank, they may eventually be able to have a Global “Bank of Life”, where bits of DNA for every (eukaryotic) species on the planet is stored. He is optimistic that we will get there soon.

And what about the cost?

“We know we can’t afford to not do it.” Paul said, “And we know we can afford to do it.”

Herve Vanderschuren (University of Leige, Belgium) spoke about circular DNA enrichment for full-length sequencing. His lab developed a new approach called CIDER-Seq for unbiased sequencing of small circular genomes and applied it to sequencing of the Geminiviruses which impact the Cassava plant — a very important plant growing in the tropics. This enabled them to modify Cassava to become resistant to this virus, which has already been field tested. In addition to viral sequencing, the general approach of CIDER-Seq also enables this method to be applied for extra-chromosomal DNA sequencing.

bitter gourd

Henri Van de Geest (@geesthc, Genetwister Technologies, Netherlands) used PacBio sequencing to assemble the bitter gourd (Momordica charantia) genome, an increasingly important crop for southeast Asia.

Using the PacBio Sequel System to generate >100-fold coverage and BioNano data, they obtained the 11 chromosomes in ~300 Mbp of genome assembly. Comparing it with other cucurbit genomes (ex: squash, cucumber) revealed evolution at the chromosome level. In addition, he used the longest reads from the PacBio data set, comprising 66-fold coverage of the genome, and assessed three different genome assemblers: Canu, Wtdbg2, and Flye. Wtdbg2 was fastest but performed the worst on BUSCO analysis and had the most mis-assemblies; Canu achieved the fewest number of contigs, fewest mis-assemblies, and took the longest to run. Flye’s performance was in the middle.

The Cannabis sativa L genome was sequenced using PacBio, funded using cryptocurrency.

Kevin McKernan (Medicinal Genomics, USA) sequenced the Cannabis sativa L genome. Weed is cool, but what’s cooler is they used cryptocurrency to fund the sequencing! Combining the PacBio Sequel System with Hi-C from Phase Genomics, they achieved nearly chromosomal resolution. The new reference genome enabled better RNA-seq analysis and after they added full-length transcriptome (Iso-Seq) data, they were further able to see interesting alternative splicing in genes such as CsPT1.

Fritz Sedlazek (@sedlazeck from @BCM_HGSC) sequenced the genome and transcriptome of a European cave fish (in collaboration with @UniOldenburg and @unikonstanz) as the winner of the 2018 Plant and Animal Sciences SMRT Grant. What is unique about this cave fish, other than the fact that they live in essentially complete darkness, is that while most cave fishes diverged a few million years ago, this cave fish diverged only 12,000 years ago, giving a unique look into evolution in action. Furthermore, there are currently no reference genomes for any loach species. FALCON-Unzip successfully phased ~70% of the genome, and using Iso-Seq for genome annotation, they want to look at which genes are impacted by these fish’s unique living environment. Iso-Seq obtained 16k candidate genes (close to the number of estimated genes) and 58k candidate transcripts. Lastly, Fritz showed that they have successfully cross-bred the cave fish with a surface-dwelling species (F1) and have plans to create F2s as well.

Barbatula (stone loach), the first European cave fish to be sequenced. Photo provided by Fritz Sedlazek.

Kateryna Makova (Penn State, USA), an early adopter of PacBio sequencing and a prolific researcher (her work spans the breadth of assembling the gorilla Y chromosome to targeted Iso-Seq), presented the second keynote talk of the day on how non-canonical DNA affects polymerase speed and error rate.

She explained that 13% of human DNA, at certain motifs, can form non-B (non-canonical) DNA. These motifs have important functions in the genome, including protection of telomeres, regulation of gene expression, and regulation of origins of replication. But they are also associated with diseases such as cancer, neurological and muscular degenerative diseases, and can serve as anti-cancer drug targets.

3D structure of G-quadruplex. Figure from Wikipedia.

The focus of her study tries to answer the question: how do non-B DNA motifs affect polymerase speed and error rate?

Using an existing non-B DNA database, annotated short tandem repeats (STRs), and the Genome In A Bottle Consortium Ashkenazim PacBio WGS data, they observed an elevation in IPDs (Interpulse Duration) where G-quadruplex motifs are located. Further, G-quadruplex motifs show a strand-specific slowdown in polymerization. Interestingly for the technical geeks, Makova noted that this IPD elevation is stable even after multiple passes of sequencing in the circular template. Similarly, STRs result in periodic, motif-specific IPD elevation. Makova concluded that non-B DNA motifs almost always affect polymerization rate compared to motif-free B-DNA. But what is the explanation for this? Neither sequence composition nor sequence modifications explain the IPD changes. Makova hypothesizes that alternative DNA formations are likely the strongest influencers. She shows that the G4 structure is formed within the PacBio machine; in fact, even the (GGT)n motif can form a G4-like structure, resulting in a similar IPD profile!

For the second part of her talk, Makova turned her attention to how non-B DNA motifs may affect polymerase error rate. She showed that different non-B DNAs have different effects. For example, G4+ increase mismatch rates, while A-Phased repeats reduce deletions. They showed that these error changes are not explained by base composition. So, what could be underlying this? Makova makes the link that the alternative DNA structures, which slow down polymerase speed, are associated with increased error rates, particularly at G4+.

To conclude, Makova showed us that non-B DNA structures affect polymerase speed and error rate, possibly not just in a sequencer, but also in vivo.

This work was published in Genome Research in 2018.

Oliver Duss (Scripps Research Institute, USA) used PacBio sequencing to directly monitor transcription, nascent RNA folding, and ribonucleoprotein assembly in real time. Ribosomes are the protein production machines of cells and are composed of 50 proteins, 3 rRNAs, and >30 assembly factors. Nascent RNA is first transcribed and then translated, folded, modified, and assembled. Single molecule fluorescence microscopy (SMFM) offered a unique opportunity to monitor thousands of individual molecules simultaneously. SMFM enables the observation of transcription and protein binding, simultaneously. Transcription complexes are immobilized in Zero Mode Waveguides (ZMW) on a SMRT Cell to observe protein binding to single RNAs. Oliver et al. tracked transcription of single RNA molecules ranging from <100 to >500 nucleotides with average transcription rates of 0.4 -60 nt/second. They also detected RNA polymerase stalling a few seconds at transcription terminators before dissociation from the DNA template.

Oliver’s work was published in Nature Communications.

Then we had two Iso-Seq talks.

I gave an update on two interesting aspects of Iso-Seq. First, I shared a preprint showing a novel method for synthesizing probes for target enrichment of full-length transcripts. The OCS (ORF Capture Seq) probes were compared to commercial probes and were shown to be just as effective at capturing low abundance transcription factor genes, sometimes enriching them by more than 7000-fold! Using probe-based enrichment achieves the best of both worlds: less sequencing is needed to achieve higher resolution for genes of interest while the use of probes — instead of gene-specific primers — allows capturing alternative start and ends. I then showed two cases of single cell Iso-Seq. In the first case, Russell et al. enriched for viral transcripts in H1N1-infected cells and showed that viral genetic mutation is a significant (but not sole) contributor to heterogeneity in viral burden and contributes to immune response. In the second case, PacBio’s own Jason Underwood recently unveiled at AGBT2019 sequencing of primate brain organoid cells using a Drop-Seq platform.

Iain Macaulay (Earlham, UK) followed up with a second single cell talk. As early as 2015, his lab looked at using long read data to confirm novel fusion transcripts in single HCC38 cells. Recently, he applied PacBio sequencing to two ends of the single cell system using mouse models during hematopoiesis. On one end is the droplet-based 10X platform, where thousands to tens of thousands of cells can be queried in a single run. Using 1 SMRT Cell on the Sequel System he detected ~25,000 transcripts (or ~14,000 genes) per library. Preliminary analysis pointed to potential isoforms that may be key regulators in hematopoiesis. On the other end, the plate-based (96-well) sequencing achieved much higher coverage of 300–3000 genes (or 20,000–30,000 reads) per SMRT Cell. The long-read data is accurate and can be combined with matching short read data for cell type information. “The era of single-cell isoform sequencing has begun”, concluded Macaulay.

“God has an inordinate fondness for beetles.” — J.B.S. Haldane. Drawer image from Wallace and Banks collection, Natural History Museum, UK, obtained under CC License.

Mara Lawniczak (@marakat, Wellcome Sanger Institute, UK) rounded out day 1 of SMRT Leiden with the third keynote talk.

Continuing the biodiversity theme, a global momentum towards sequencing, understanding, and protecting Earth’s biodiversity led to the Earth BioGenome Project. It is a moonshot for biology aiming to sequence, catalog, and characterize genomes of all Earth’s eukaryotic species. Phase 1 includes the Darwin Tree of Life Project, which aims to collect 8000 species and generate reference genomes for 2000 of them. The reference genomes will be sequenced using a mixture of PacBio, Hi-C, optical mapping, linked reads, and full-length transcriptome sequencing. For 5% of the species, they will further collect population data.

Mara points out one key challenge in genome sequencing: many of the species are small and heterozygous.

The mosquito genome endeavor highlights this challenge. Mosquito control is a huge health issue. Many Anopheline species are major malaria vectors. The Anopheles gambiae 1000 Genome Project aimed to create a public database of mosquito genomes for better malaria control.

Mara points out just how much insight could be gained with population-level mosquito genomes. Having sequenced ~800 mosquitoes from 8 different countries, they found huge heterozygosity — an estimate of 2.5 million SNPs per mosquito!

Further, the population-level information revealed that insecticide resistance can have multiple origins. For example, the VGSC gene is an essential component of the insect nervous system and the L995S/F mutation reduces efficiency of binding to DDT and pyrethroids (the only approved insecticide for bed nets). WGS data revealed at least 10 separate outbreaks of resistance at the VGSC gene. In another example, they found that metabolic genes were enriched for copy number variants, also often with multiple origins of duplication.

Anopheles gambiae, a mosquito that a known malaria vector. Photo from Wikipedia.

Mara then turned her attention to what she dubbed “Neandersquitos” — mosquitoes from museum collections dating back to more than 100 years ago. She wants to see what the mosquito genomes looked like before the introduction of insecticides. Leveraging the vast mosquito collection from the Natural History Museum and with a lot of technical development, they’ve successfully extracted DNA from 10 ancient mosquitoes dating 1933–1988, though the sequencing data has shown clear evidence of water damage to the DNA. The goal of looking at Neandersquitos is ultimately to see whether loci related to insecticide resistance has evolved.

“How can we outpace the evolution of resistance?” Mara explained that all analysis will depend on having high-quality reference genomes. And we are not quite there yet.

For the second part of her talk, Mara spoke about sequencing the A. coluzzi genome from a single mosquito using PacBio’s new low DNA input workflow. The new workflow requires only 150 ng of DNA for up to 300 Mb of genome and can be scaled linearly with genome size; no DNA shearing or size selection needed. This workflow resulted in a high-quality A. coluzzii genome with a contig N50 of 3.5 Mbp that required a single animal and 3 Sequel SMRT Cells. Compare this to the A. aegypti genome that required 80 animals, 177 RS II SMRT Cells, and the difference is huge! The low DNA input workflow used for the mosquito assembly was published in Genes.

Looking towards the future, Mara is working on barcoding for low-input because the Sequel II SMRT Cell 8M is actually overkill for some of the smaller species. She’s also working with the Bill & Melinda Gates foundation to sequence >10 South American and African malaria vector species.

Finally, some robots! Mara described Project Premonition at Microsoft Research, where they developed a “robotic field biologist” which records and tracks mosquito movements to identify and trap mosquitoes of interest. This could be used to monitor disease outbreaks and also capture mosquitoes that could be used for sequencing.

And that concludes Day 1 of SMRT Leiden 2019! We began the day with the initiative to use new technology to identify, register, and monitor life. Throughout the day, we heard how long read technology is revealing new insights into the diverse species that are relevant to our nutrition (bitter gourd, cassava), medicine (cannabis), and evolution (cave fish). We saw how technology can be developed to query chemical processes (transcription) and reveal new information (single cell). We ended our day with the grand vision to sequence all life on Earth, both past and present. It is the foresight of those who preserved the past and the visions of those who look into the future, combined with the creative use of emerging technologies, that will determine how successful we will be in carrying on life on this Earth (and possibly beyond)!

Photo by veeterzy on Unsplash