SMRTLeiden 2018, Day 1: This Revolution is Data Driven

Published in

PacBio

8 min readJun 12, 2018

Summary of the first day of SMRT Leiden, a 3 day conference bringing the SMRT Community together to share their scientific discoveries and novel analytical achievements using PacBio sequencing

The Bat1k consortium aims to sequence all extant bat species in the world. Photo by Eric Pedersen Torales on Unsplash

Sonja Vernes (Max Planck Institute for Psycholinguistics) the director for the Bat1k consortium from Max Planck, gave the first keynote of the day. The Bat1k project is dedicated to sequencing all existing bat species, with the goal of constructing chromosome-level genome assemblies, performing annotations, and cataloging unique genetic endowment and diversity in bats. Why do we care about bats? There are ~1300 species of bats with extreme phenotypic diversity (ranging from those giant golden capped fruit bats with 1-meter wing spans to the tiny x bats that wrap around your fingers). They are social animals that do vocal learning. And they live for a very long time, being very resistant to viral infections, cellular damage, telomere shortening, and cancer. Previously assembled bat genomes are of low quality (typically ~0.02 Mb contig sizes, thousands of contigs, QV≤40). With the application of PacBio sequencing, the contig N50 of newly assembled bat genomes is on the order of several Mb. Additional complementary technologies (10X, Bionano and Hi-C) are used to achieve chromosome-level scaffolding. Of particular interest to Sonja are the genes involved in the language pathways, such as the FOXP2 gene. She plans to do PacBio Iso-Seq analysis to obtain full-length transcripts and look for splicing differences between different bat species’ vocal genes. Finally, Sonja points out that Bat1k.com is open to all volunteers!

Olga Pettersson (Uppsala University, National Genomics Infrastructure, SciLifeLab) gave a very informative and enlightening talk about the importance of the chemical purity on the quality of long-read DNA sequencing. Long DNA molecules are not enough — low chemical purity will cause worse loading, shorter reads, and higher overall project cost. Dr. Pettersson described the approach of “DNA-spa” — letting the molecules relax for 3–7 days at room temperature will enable the molecule to untangle and help achieve better sequencing output.

By the way, Olga is apparently also an amazing artist. Here’s a bracelet she made from used SMRTcells! (Upcyling at its best!)

Inge Kjaerbolling (Technical University of Denmark) talked about using the new Aspergillus genomes for linking compounds to metabolite clusters. With the application of the whole genome sequences, it is possible to identify secondary metabolite gene clusters. The secondary metabolites are bioactive compounds responsible for various important functions, such as the virulence and pathogenicity of a fungus. This talk demonstrates the enormous potential for the application of this research. Her work has been published in PNAS.

John Hammond (Pirbright Institute) targeted sequencing of immune-related genes. They’re exclusively focused on the interaction of leukocyte receptor complex (LRC), natural killer complex (NKC), and the MHC in cattle. These immune genes are highly repetitive and difficult to sequence and poorly annotated. He reveals the astonishing diversity in the MHC and KIR regions between different cattle species (unpublished work). Short reads were not able to span over the repeats, hence the need for long reads. He went through four rounds of probe design; off-target enrichment is a major issue which could be caused by multi-mapping and polymorphism. Based on the current probe set, he’s already able to do de novo assembly of several immune genes.

Jaanus Suurvali (University of Cologne) won the best poster award for SMRT Leiden and gave his talk on targeted sequencing of wild zebrafish immune gene families. The zebrafish, despite being a model organism and having been studied for decades, is poorly understood of its immune repertoire. Jaanus uses probes from Arbor Biosciences to sequence the FISNA-NACHT-helixes, MHC geness, and the B30.2 region. He succeeded with ~60% on-target rate and was able to demultiplex almost all of the data using LIMA.

Ben Matthew’s arm (from the NYT article). He sacrificed some arm blood to a female A. aegypti mosquito for science.

The afternoon started with Ben Matthews (Rockefeller) talking about the genome sequencing of the Zika carrier, the Aedes aegypti mosquito. With PacBio SMRT Sequencing, they sequenced Aedes aegypti using 177 PacBio RS II SMRT Cells and generated 140 Gb of data. An assembly with FALCON and FALCON-Unzip resulted in an assembly of this highly complex genome (65 % repetitive) with a contig N50 of 1.43 Mb. Further methods included Hi-C for de-duplication and Iso-Seq and RNA-seq data for annotation. They were able to characterize the elusive male-determining M-locus and validate its structure using Bionano data. The new genome assembly was fully re-annotated, fixing 481 fragmented gene models and collapsing 1463 false paralogs in the old Sanger-based assembly. One of the most exciting avenues of research is dissecting the neurological and chemo-sensory basis to female attraction to humans. Matthews described a gene called orco which is co-expressed with odorant receptors. CRISPR Cas-9 genetic contracts allowed them to visualize expression in the proboscis and the brain. Importantly for mosquito control, shutting off the gene results females laying eggs in salt water, with lethal consequences to offspring who are normally deposited in fresh water. In the future, Matthews is hoping to characterize the dramatic variation present in mosquito populations through more datasets including additional de novo assemblies of divergent individuals. A preprint of the paper is available. And if you want to look at some absolutely stunning images of A. aegypti in adult and pupae stages, as shown by Ben during his talk, visit Alex Wild’s photography site.

Mohamed Zouine (INRA/INP Toulouse) presented the tomato genome project. Starting in 2006, this project aimed at getting a good basis for studying genotype-phenotype relationships, comparative genomics and natural diversity. After a first version in 2012 with 110'872 contigs (using 454, SOLiD and other data), they obtained a strikingly improved assembly by SMRT Sequencing on 55 PacBio RS II SMRT Cells. Their assembly resulted in 543 contigs with a contig N50 value of 3.4 Mb and a total genome size of 800 Mb. His group is now applying a similar approach for sequencing and assembling the melon genome.

The great ape paper gracing the latest cover of Science.

Zev Kronenberg (Phase Genomics) next talked about using Hi-C to complement PacBio sequencing. I had met Zev when he was at the Eichler lab at the University of Washington, working on the great apes comparative genome + transcriptome sequencing project (published in Science last week). Today he is talking in his new incarnation as a Phase Genomics scientist. Zev described how the ultra-long range information in Hi-C data can be used to order and orient PacBio contigs into chromosomes. In addition to de novo scaffolding, Hi-C can be used to validate scaffolding by other technologies. He warned that chimeric or mis-assembled contigs can cause problems for scaffolding efforts and recommends cutting contigs in regions of aberrant coverage (hi and low) prior to scaffolding. Only half a year into his new job, he’s already contributed three tools: Polar Star for breaking chimeric PacBio contigs using Hi-C and Matlock for Hi-C data pre-processing and FALCON-Phase. FALCON-Phase is a brain child between Sarah Kingan and Zev when they chatted at the PAG conference this year, and is a method for using Hi-C to scaffold FALCON-Unzipped PacBio genomes. Sarah Kingan will be presenting the details of FALCON-Phase on Thursday. Finally, Zev described how Hi-C can be mapped to metagenomic assemblies in order to cluster contigs into complete bacterial genomes. His full presentation is here.

Figure from great ape Science paper. A 62-kb deletion in human brought the promoters for FADS1 and FADS2 together and also shortened the first intron. Potentially, this resulted in the long isoforms (L1, L2) have a higher expression (14% and 3%) in human than in chimp (1.25% and 0.08%). There’s also a novel exon in chimp that was not observed in human.

After Zev, I talked on the general topic of Iso-Seq “ The Iso-Seq Method for Human Diseases and Genome Annotation”, highlighting several recent publications including the great ape Science paper (of which Zev is the first author) and solving of a rare disease, X-linked Dystonia-Parkinsonism. In the great ape paper, they sequenced two humans, one chimp, and one orangutan using PacBio for both genome assembly and transcriptome using the Iso-Seq method. To me, this paper is the perfect demonstration of what can be done when you have drastically improved genome assemblies and full-length isoforms. You begin to identify what makes us human.

“The next green revolution will be data driven”, says Doreen Ware.

Doreen Ware (USDA, CSHL) gave the closing keynote. I first collaborated with Doreen and Bo Wang in 2015 for their maize Iso-Seq study and later the 2018 maize & sorghum comparative Iso-Seq study. The study of maize is not about just another plant. It’s about global sustainability while feeding the human population. Maize is also a model organism when Barbara McClintock first showed the “jumping genes” in maize in the 1950s, the work that would earn her a Nobel Prize in 1983. Maize went through a whole genome duplication but then lost many of the genes. Looking at the genes that survived was telling: transcription factors, kinases, chromatin modifiers. Each maize individual can be more different in their genome sequence than human are to chimps! In 2017, they published the B73 v4 genome using PacBio data assembled with FALCON, which dramatically closed the gene gaps. Individual maize lines can vary between 5–20% in their gene content! There is a case in which “one genome to rule them all” does not apply. She plans to sequence 26 maize lines using PacBio, Illumina, RNA-seq, and optical mapping. The data will be hosted in a “Pan Genome Browser” accessible via Gramene. Doreen then talked about the 2015 Iso-Seq and 2018 maize & sorghum Iso-Seq study. In the 2018 study they were able to show that “young” transcripts have shorter open read frames and fewer isoforms, and reproductive tissues (ex: pollen) have a younger transcriptome age. She described the current collaboration I have with her in using Iso-Seq to phase a new set of F1 hybriz maize species. She concluded the talk with the prediction that “The next green revolution will be data driven”.

This concludes day 1 of the three day SMRT Leiden event. You can follow organic tweets about this on Twitter using #SMRTLeiden.

[2018/07/11 Update: Speaker presentations for the event are now online]

SMRTLeiden 2018, Day 1: This Revolution is Data Driven

Written by Liz T