Shinichi Morishita (University of Tokyo) gave a densely packed and awe-inspiring keynote talk covering topics with implications for human disease, speciation, structural variants, haplotype phasing, and metagenomics.
“Can you rely on single genome assembler?” Morishita asked. He sequenced VC2010, a non-mutagenized clonal derivative of the N2 C. elegans worm, using PacBio. Using ONT reads as scaffolding and combining results from different assemblers, he was able to close gaps in the reference genome. “PacBio and nanopore data are complementary.” he concludes.
When you see a difference between your assembly and the reference genome, “Is it structural variation of reference error?” Using an outgroup, he verified that the original N2 Sanger reference contained a gap that was recovered in VC2010. In this case, it was a reference error, rather than SV. Moral of the story: don’t always trust your (old) references!
Repeat expansions are known to be related to neuronal dysfunctions. In the case of benign adult familial myoclonic epilepsy (BAFME), Morishita discovered that a 5kb repeat expansion in the intron of SAMD12 caused transcription abortion.
Completely switching gears, we then learned about the evolution of centromeres from the sequencing of three medaka (Japanese rice fish) that are 18 and 43 million years apart. He showed that centromeres do not evolve at the same pace. Rather, non-acrocentric centromeres evolve significantly faster than acrocentric centromeres. It seems that most centromeres are hypermethylated but there are also hypomethylated sub-regions that evolve independently.
In a topic that is close and dear to my heart, Morishita showed how you can combine diploid methylome with diploid transcriptome. Sequencing the ZNF311 gene using the Iso-Seq method, he found all transcripts were transcribed from haplotype B, which was corroborated by methylation evidence that hapB was hypomethylated!
Finally, Morishita discussed a metagenomics project using PacBio sequencing. Not only did they obtain complete circular genomes for many bacteria, they also found many new plasmids that could confer information on antibiotic resistance. They also found good results using methylation-guided alignments, similar to what Beaulaurier et al had reported in 2017.
Laurence Ettwiller (New England Biolabs) presented an exciting new full-length transcriptome protocol for bacteria. In 1961, Jacob & Monod first discovered that bacteria transcribe functionally related genes together using a common promoter. Targeting the 5’ triphosphorylate cap in bacteria primary transcripts, NEB developed the SMRT-Cappable-Seq protocol to capture full-length bacterial transcripts using the Iso-Seq method. Sequencing the same E. coli transcriptome, they found good quantitative correlation between the RNA-seq and Iso-Seq data. Importantly, long reads revealed the complexity in which the same bacterial gene can be transcribed in multiple operons. They estimate ~50% of E. coli genes are present in 2 or more different operons. Comparing different growth conditions show that read-through is condition-dependent. This work shows an astonishing amount of transcriptional diversity in a species that’s been so extensively studied, one can only imagine just how much more could be discovered if this method is applied to other bacteria!
Stuart Scott (Mt Sinai, Icahn School of Medicine in New York) used SMRT Sequencing to identify and phase variants important for human disease mutations. Using bisulfite treated DNA followed by PCR amplification and SMRT Sequencing, he was able to characterize methylation patterns in targeted regions. Using amplicon sequencing, he confirmed homozygous Bardet–Biedl syndrome copy number aberrations. In applying PacBio to pharmacogenomics, he highlighted the importance of obtaining patient genetic information on drug metabolizing genes such as CYP2D6 and CYP2C19. Another gene of interest was SLC6A4, which mediates serotonin reuptake; PacBio sequencing was able to haplotype the promoter region of this gene. Finally, he presented recent work on Gaucher disease which is caused by the GBA gene. This Mendelian disease is particularly difficult to characterize due to the presence of a segmental duplication and corresponding pseudogene. His lab designed a 5 kb PCR-based assay which allowed phasing of 3 known pathogenic mutations using PacBio sequencing.
Marjolein Weerts presented her work on inferring cancer signatures on the basis of low-frequent mitochondrial DNA (mtDNA) circulating in the blood stream. Using circulating free DNA for early-stage detection of cancer is a hot topic. She showed how heteroplasmy of mtDNA can yield spurious results and false positives. Marjolein showed her custom pipeline to detect low-frequency variants using CCS reads. While the field is developing rapidly, Marjolein warned that she rarely detected free mtDNA specific to tumors.
Armin Töpfer (PacBio) unveiled the forthcoming version of structural variation calling in PacBio’s official SMRT Link/SMRT Analysis software suite. pbsv 2.0 improves upon the original pbsv, including support for calling inversions, translocations, insertions/deletions under 50 bp, polishing breakpoints, and importantly, improved runtime. The new pbsv 2.0 can call insertions/deletions down to 20 bp at 10-fold coverage with 90% sensitivity. Combining all existing PacBio SV dataset, he created a mock human dataset that is equivalent of 460-fold human coverage; pbsv 2.0 completed within an hour. Using joint calling (calling variants using both parents and proband), one could call a lot more variants with confidence. Even at 5-fold coverage, there is still discovery power (you can play with this SV calculator). His full presentation is here.
Alex Hoischen (Radboud UMC): presented their latest findings on structural variant detection with SMRT Sequencing in human genomes. He emphasized the importance of detecting all variants in a human genome, as one single variant can cause disease. Greater than 60% of severe intellectual disability (ID) is caused by de novo mutations. While the mutation rate for SNV is largely known, the mutation rate for large indels is largely unknown. He highly recommends trio sequencing to reduce search space and increase detection of mutations. In a collaboration with PacBio, he sequenced 5 trios; four with ~15-fold coverage, one with 40-fold coverage. These data revealed that a striking 28 Mb of human reference genome is only covered by long reads, of which 12 Mb are genic regions and 757 kb coding sequence. Per genome, they found around 25,000 insertions and deletions larger than 50 bp, and an additional 33,000 indels between 20 and 50 bp! Around 70% of them are found with PacBio long reads only, compared to short reads. New tools, among them then next version of the PacBio pbsv tool, allow looking at other types of SVs as well, such as inversions and translocations. He also talked about the SolveRD initiative that would sequence >500 genomic and 100 transcriptome samples using PacBio sequencing.
Martin Pollard (Sanger Institute) described an effort to generate an expanded reference panel of MHC haplotypes from African populations. MHC is a highly complex region that stymies short read-based characterization due to its high SNV and SV diversity and presence of homopolymer repeats. Pollard and his colleagues designed a multiplexed PCR amplicon-based assay which relied on PacBio’s LIMA tool to demultiplex, LAA to cluster and call consensus, and additional custom methods to accurately resolve phased haplotypes. They sequenced more than 2000 samples and identified between 20–150 new alleles per HLA gene. Using the new expanded reference panel, imputation of haplotypes from short read markers showed higher concordance. Pollard anticipates eventually abandoning imputation from reference panels in favor of direct sequencing of individuals as the cost per base improves on the PacBio platform.
Birgitt Schuele (Parkinson’s Institute) discussed her latest publication “Parkinson’s disease associated with pure ATXN10 repeat expansion” that applied PacBio’s No-Amp method (preprint) to sequence repeat expansions in the ATXN10 gene. Repeat expansions of ATTCT in intron 9 of the ATXN10 gene often cause progressive spinocerebellar ataxia. Using CRISPR/Cas9, they captured and sequenced the entire 5.3–7.0 kb repeat expansion. They found that new pentanucleotide repeats in the ATXN10 region that may explain the unusual phenotypes. It is now determined that ATXN10 repeat expansion is associated with Parkinson’s disease, though the pathology is not yet understood.
Yahya Anvar (LUMC) is the co-host of the event and also a collaborator. LUMC has had a long history of storing pharmacogenomics data in electronic form for patients, which can offer them personalized drug dosing recommendations! (US needs to catch up to Netherlands!) The pharmacogenomics data show that 95% of patients have at least one actionable drug dosage, i.e. pretty much all of us should be taking at least one drug in a different dosage that doesn’t follow the “one size fits all” dosage recommendations. By sequencing a cohort of breast cancer patients’ CYP2D6 region, he’s able to predict their drug responses with 90% concordance. This is the future of personalized medicine, except that future seems incredibly close to now.
Charles Lee (Jackson Laboratory) gave the closing keynote. He talked about the importance of detecting structural variants and in a current study his group compared different technologies for structural variation detection. Structural variants below and above 50 kb are not entirely detected by either short-read technologies or PaBbio sequencing and a joint detection strategy using different technologies and algorithms is necessary for comprehensive structural variation detection. Combining these technologies they were able to detect 7 times more structural variations per person than typically found in WGS analysis. This approach shows great promise for disease association studies and clinical interpretation of genomes.
[2018/07/11 Update: Speaker presentations for the event are now online]