SMRT Leiden 2019 Day 2: Evolving Towards Precision Medicine with Longer, Accurate Reads

I want the country that eliminated polio and mapped the human genome to lead a new era of medicine — one that delivers the right treatment at the right time.”

— — Barack Obama, State of the Union 2015

Euan Ashley (@euanashley, Stanford) began the second day of SMRT Leiden with a keynote presentation on precision medicine.

What is the definition of precision medicine?

Before defining precision medicine (or personalized medicine), the more important question to ask should be: why is precision medicine needed?

The answer could be found in Ashley’s disclosure of personal interest: “I want to help patients”.

Or, if one recalls the next sentence in President Obama’s State of the Union address, “To give all of us access to the personalized information we need to keep ourselves and our families healthier.”

Delivery of such information must not only be accurate, but also timely. One such case was the diagnosis of a newborn suffering from long QT syndrome, a condition that causes irregular heartbeats, where whole genome sequencing confirmed a pathogenic variant in KCNH2 gene at 10 days post birth. At that time, standard genetic testing would have taken 4 to 16 weeks.

And for many, though urgent delivery of information is not a matter of life or death, the eventual diagnosis means the world to them. The Undiagnosed Disease Network (UDN) was established in 2014 to solve rare, previously undiagnosed diseases. Within a 20-month period, they delivered diagnoses for 132 out of 382 cases, a 35% solve rate. More importantly, 79% of the solved cases had actionability; 1/3 of which led to changes in therapy. Finally, there’s cost! The average cost of patients who received a diagnosis before coming to UDN was ~$300,000, whereas the UDN diagnosis cost was only ~$18,000, or 6% of the previous cost. The value of an accurate genomic diagnosis was not just medical and emotional, but also financial.

So that’s the good news. But the other side of the 35% solve rate is that 2/3 of the patients still have no answer for their illness.

What are we missing?

Precision medicine, Ashely defined, is the treatment or prevention of diseases by more precise measurement. But the human genome is complex. It is filled with repeat sequences, paralogous genes, and mosaicism. For many of the repeat diseases — Fragile X, myotonic dystrophy, Huntington’s, just to name a few — “short reads just are not able to characterize these highly polymorphic regions,” Ashley said. Computationally, these regions also present challenges.

Keynote by Euan Ashley on the progression of precision medicine. Sketch provided by @ATJCagan.

“Precision medicine has to be accurate medicine. If we are not measuring the whole genome, we will fail to find the genetic causes for our patient’s diseases.”

The solution? Better algorithms, graph reference genomes, and long reads.

Ashley presented the instance where PacBio sequencing solved the case of a patient with recurrent myxomas. It was suspected that he had Carney complex, yet prior genetic testing including WES and WGS failed to identify the causal variant. Using low-fold sequencing on the PacBio Sequel System, they identified — using just 4 long reads — a heterozygous 2kb deletion in the PRKAR1A gene. Sequencing of the parents confirmed this to be a de novo variant.

Ashely then shared another case where a causal variant was identified at the FAM177A1 gene, and they are still working to link the variant and the disease phenotype.

For some of the unsolved diseases, RNA may play a role as well. Ashley pointed to a recent project of looking at the MYBPC3 isoform diversity using targeted Iso-Seq (paper to appear in Circulation: Genomic and Precision Medicine soon) where PacBio long reads enabled accurate haplotyping at the isoform level.

Ashley is optimistic that with improved sequencing technology, better reference genomes, and faster algorithms, we will be able to solve more of the rare diseases.

Kornelia Leveling (Nijmegen, Netherlands) showed several use cases for long read amplicon sequencing. In the first example, they sequenced the VWF gene for the Von Willebrand disease, which is a bleeding disorder. Using two overlapping 8 kb amplicons to cover this region, they confirmed the PacBio data recovered all previously known pathogenic variants. Then, they tackled the PMS2 gene that is involved in hereditary colorectal cancer. The PMS2 gene is challenging to sequence because it has many pseudogenes. They sequenced the gene in eight fragments ranging in size from 3 to 8 kb. Though the pseudogene PMS2CL was still amplified, informative SNPs allowed clean separation from PMS2. She then showed how long-read sequencing can characterize opsin genes that share high (~98%) similarity and are duplicated in tandem. Leveling concluded that long read amplicon sequencing has several advantages in clinical applications — including less PCR, improved breakpoint detection and haplotyping, and the ability to separate paralogous genes.

Melissa Laird Smith (@lissagoingviral, Mt. Sinai, USA) spoke about how she is using long-read sequencing to characterize the full diversity of the immunoglobin heavy chain locus. Historically, genetic variation at the IG loci has been poorly characterized, despite it being known to fall within a segmental duplication and harbor copy number variation. A look at the number of known SNPs shows just how little is known: the HLA gene had 9319 known SNPs, the KIR gene had 818, and the IGH gene had only 5! Since 2012, ~120 new IGHV gene alleles have been discovered; 90% of which have come from non-Caucasian samples, underscoring the underrepresentation of diversity at this locus. To create more and better references of the IG loci, Smith and her team sequenced fosmids provided by the Eichler lab consisting of two African, two Asian, and two European individuals. The amount of diversity revealed was immense — for some V genes, a single individual was found to have up to 14 alleles! To scale this pipeline, they have developed a targeted IG genotyping assay and the pilot data shows they can achieve accurate haplotyping.

Figure from Watson et al. showing the diversity of the immunoglobin (IG) loci.

Ming-Hsiang Lee (SBP Medical Discovery Institute) presented work on somatic mosaicism in the APP gene in relation to sporadic Alzheimer’s disease (SAD). His talk was divided into three parts.

First, he showed that somatic mosaicism exists in neuronal cells. These mosaicisms are termed “gencDNAs” because they were spliced out versions of the APP gene. The most abundant gencDNA contained the spliced-out exons 3 and 16 (denoted 3/16) and another common intra-exonic junction was 16/17. They validated the existence of both 3/16 and 16/17 using PCR, DISH (in situ hybridization), and targeted sequencing.

Figure 3 from Lee et al. showing sporadic Alzheimer’s disease (SAD) is enriched for APP “gencDNAs” (retro-insertion of spliced out APP mRNAs via RT activity) compared to non-disease samples. Furthermore, only the APP gencDNAs in SAD contain pathogenic variants.

Having established that APP gencDNAs exist in the genome, they then used PacBio sequencing to look at whether the diversity of gencDNAs were different between sporadic Alzheimer’s disease (SAD) and non-disease (ND) samples. Accurate, long reads were necessary to characterize distinct gencDNAs. They found that SAD samples had a higher abundance of gencDNAs (6,299 unique gencDNAs) compared to ND samples (1,084 unique gencDNAs). Furthermore, only the gencDNAs from SAD contained pathogenic variants; the gencDNAs from ND did not.

What, then, is the mechanism by which APP gencDNAs arise? Using in vitro as well as in vivo mouse models, they showed that three components were required for the APP mosaicism: APP mRNAs (spliced out APP recombination), reverse transcriptase, and DNA strand breaks. The proposed mechanism is that the RT “retro-inserts” APP mRNAs back into the genome at DNA break sites. Finally, they showed that APP gencDNAs accumulate with aging! The largest risk factor for SAD is age, and their findings are consistent with previous findings that increased APP transcription is linked to SAD incidence.

Figure 4 from Lee et al. showing the hypothesized mechanism for APP gencDNAs. Spliced out mRNAs are retro-inserted into DNA break sites by reverse transcriptase activity, resulting in intra-exonic junctions (IEJs) in the genome.

The clinical implication of this study is significant: since APP gencDNAs arise due to RT activity (and they showed in vitro that RT inhibitors reduce gencDNA occurrence), existing drugs that contain RT inhibitors may be used for treating SAD. Furthermore, the theory of insertions into DNA break sites may explain why processes that induce DNA breaks (such as head injuries) have been linked to the disease.

Lee’s work was published in Nature.

PacBio HiFi reads improve mappability of medically relevant genes. Figure taken from Wenger et al preprint.

Billy Rowell (@nothingclever, PacBio) discussed the advantages that of HiFi reads for structural variants calling and de novo genome assembly. The HiFi (CCS) reads from the PacBio Sequel II System are long (10–15 kb) and highly accurate (>99%), which allowed for improved detection of both small and large variants. Using low (15-fold) coverage, the HiFi reads achieved accurate variant detection using the DeepVariant and pbsv callers. For de novo assembly, Rowell reported that PacBio is working on two approaches that would enable fully phased highly concordant assemblies without the need for parental data. A preprint for this work is on biorxiv.

Janet Song (Stanford, USA) delivered the second keynote of the day on the use of long-read sequencing to study tandem repeats in the CACNA1C intronic region. The CACNA1C gene was identified via a GWAS study to be involved in multiple psychiatric diseases, including bipolar disorder and schizophrenia.

The GWAS locus is in the ~300 kb region of the third intron of CACNA1C. Based on existing data, they identified a tandem repeat region consisting of consecutive 30-mers. Interestingly, chimp only has 1 copy in this region, whereas humans have a variable number of the repeat (100–1000+ or 3,000–30,000+ bp) which is much larger than what’s present in the human reference genome (10 copies).

Using PacBio long reads, Song and team were able to fully characterize the repeat length and the repeat content (30-mer sequence compositions) of the 16 sequenced individuals. They showed certain sequence compositions — but not the tandem repeat length — was associated with protective/risk GWAS SNPs.

PacBio long reads revealed the 30-mer sequence composition of the tandem repeat region in intron 3 of CACNA1C. Each row is an allele and each color codes for a different 30-mer. The size of the repeats ranged from 100–200 copies (3–6 kb). Figure obtained from speaker with permission to use.

But what about the functionality of the repeat element?

Using luciferase assays and (some) in vivo data, they showed that this tandem repeat enhances gene expression. In addition, protective SNPs are associated with increased CACNA1C expression. While the repeats seem to have some effect on splicing, there appears to be no difference with respect to protective vs risk alleles.

In summary, they discovered a previously uncharacterized tandem repeat in intron 3 of the psychiatrically relevant CACNA1C gene. This repeat consists of variable 30-mers that have unique compositions associated with protective vs risk SNPs, and they hypothesize that the repeat increases the expression of CACNA1C isoforms.

This part of the work was published in American Journal of Human Genetics.

In the second part of her talk, Janet focused on what she called repeat outliers — individuals with increased repeat length, a high proportion of risk variants, and a low proportion of protective variants. It is not clear what the functional implication of these outliers are, but they hypothesize the extreme length might compensate for reduced activity of risk variants or serve as ultra-protective alleles.

Janet showed that some of the risk — but not protective — variants are better predictors for CACNA1C expression compared to the nearby GWAS SNPs in human adult cerebellum using data from the GTEx consortium!

Looking into the future, they are moving into in vivo mouse models, targeted Iso-Seq, and creating isogenic lines to look at the effects on gene expression and calcium signaling.

Janet closed the talk by highlighting the evolutionary aspect of this work. Larger brain size has been associated with the rise of psychiatric disorders. For example, recently evolved genomic regions in human was shown to be enriched for schizophrenia GWAS loci. They hypothesize that this newly characterized CACNA1C tandem repeat, which seems to increase gene function during evolution, may have also contributed to psychiatric disease risk.

Sketch of Janet Song’s talk on tandem repeat in the CACNA1C intron. Drawing provided by @ATJCagan.

Alexander Mellmann (University Hospital Munster, Germany) spoke about tracking bacterial infections in hospitals using whole genome sequencing (WGS). There is a global threat to the rise of antibiotic-resistance microbes, so dealing with this threat requires high-quality genomic information. They developed a novel pipeline that combines WGS with genome-wide gene-by-gene typing (cgMLST) to get better information about transmission rates and to evaluate hygiene control methods. Using an Illumina-based pipeline, he showed that real-time carrier screening combined with preventive patient isolation is a cost-effective way to prevent nosocomial infections. A new pipeline in development based on the PacBio Sequel System promises to improve sequencing turn-around time and plasmid-based resistance detection, and early results have been very promising.

Olga Vinnere Pettersson (@OlgaVPettersson, Uppsala, Sweden) updated us on the latest goings on at the SciLifeLab, which is a lot! Olga and her team ran more than 150 long read projects in 2018. She also presented two alternatives to the classical bait-based targeted enrichment methods, including the No-Amp (Cas9) method and the emulsion-based Xdrop method. Additionally, the lab is a beta testing site for the upcoming Iso-Seq Express kit, which reduces the sample prep time from 2–3 days down to less than 1 day and the (total) RNA input amount down to 100–300 ng. Finally, she elaborated on the difficulties of project planning due to the output variability from application to application and sample to sample, giving us all something to think about when planning our next projects.

Yahya Anvar (LUMC, Netherlands) reported the use of PacBio HiFi reads for resolving complex pharmacogenes. Differential drug response is common (in 50% of the population) and dangerous (4th leading cause of death). It turns out a major part of this can be explained by genetics. Previously, many of the PGx genes could not be characterized by short-read sequencing due to the genes’ highly repetitive nature. Yahya showed that, using HiFi reads at 30-fold coverage, 77% of the important PGx genes could be fully phased, while another 19% could be partially resolved. Yahya evaluated the long reads on the CYP2D6 gene and, using a machine learning approach, greatly increased classification efficiency. This work is now being extended to a new cohort and to new drugs to confirm results.

The streets of Leiden, where SMRT Leiden is taking place.

Evan Eichler (University of Washington, USA) closed day 2 of SMRT Leiden with a discussion on the use of PacBio to characterize structural variation in the human genome.

“Our ability to see these variations is wholly dependent on the sequencing technology.” Evan explained that genomes are complex — pointing to duplications, in particular. Approximately 4% of the human genome is repetitive and many of these regions encode for genes! Even in normal humans these regions contribute to copy number variation (CNV).

An early adopter of PacBio, Evan said the technology has three distinct advantages that make it ideal for structural variation detection: long reads, lack of bias, and near random error profiles.

His lab’s first study using PacBio for SV calling was eye-opening because it found ~22,000 novel genetic variants corresponding to 11 Mbp of sequence. They had not expected to see so much novelty.

However, even with the long reads, segmental duplications (SD) remain challenging! Looking at the FALCON assembly of a human genome, they found that 75% of the SDs were not assembled. These SD regions encompass nearly 500 genes and are the most copy number polymorphic regions — they’re important!

Their solution? Segmental Duplication assembly (SDA) by phasing through variants. The SDA begins by mapping reads to assembled contigs to identify variants which can then be used to separate reads into paralogous sequences. A paralogous sequence variant (PSV) graph is then constructed, connecting reads based on their shared variants. The identification of segmental duplicated “clusters” are then extracted from the graph using connected components.

Figure 1 from Vollger et al. (2019) describing the Segmental Duplication Assembly (SDA) approach.

Using the SDA approach, they resolved 428 duplicated genes collapsed in the CHM1 assembly. One remarkable example was the resolution of the five paralogs of NOTCH2NL, which shared sequence identities between 99.98%-100% with lengths varying from 49 to 69 kb! Interestingly, the SDA approach also identified new divergent sequences, which turned out to be new human copy number polymorphic sequences.

Eichler then revealed their recent foray into using the long and accurate HiFi reads for de novo assembly. While HiFi reads are on average shorter (10–15 kb) than CLR reads, he showed that the Arrow-polished assembly is, in fact, more accurate and contained fewer gene disruptions. The HiFi data resolved most of the segmental duplications compared to CLR and ONT approaches.

But the real strength of HiFi, Eichler contends, is in improving variable number tandem repeat (VNTR) assembly. VNTRs are tandem repeats that are 6 bp or longer. The occurrence of VNTRs in the human genome is highly non-random — 45% are within 5 Mbp of chromosomal ends. He showed that HiFi reads resolved an OR2T1 repeat that is 381 bp with 53 copies.

With excitement, Eichler showed how they first used short reads to identify, then used long reads to resolve, a 400 kb duplication event found in the present-day Melanesian population. Comparing it with archaic hominin genomes, they found that this duplication was present in the Denisovan population about ~400,000 years ago, which was then introduced into the Melanesian population ~40,000 years ago via introgression.

In day 2 of SMRT Leiden, we heard the use of long reads to resolve complex regions in the human genome that have both important evolutionary and clinical significance. We heard how accurate, long reads are being adopted for testing medically important genes that can have therapeutic actionability. We learned how the same evolution that drove us to have larger brains and likely higher intelligence, may have also led to susceptibility to psychiatric disorders.

As Eichler said in his talk: “We are all here to do one thing — identify genetic variation in the human genome.” Only with accurate understanding of the full spectrum of the human population, encompassing all races of all diversity, can we begin to move towards a future of true precision medicine.