Coordination and Chaos in the making of mRNA

Liz T
PacBio
Published in
7 min readMay 1, 2018

In June 2016, I flew to Leiden in the Netherlands to attend the first SMRT Leiden conference. It’s a scientific conference for researchers who use PacBio’s long-read sequencing technology. Gene Myers talked about his algorithmic ideas that would eventually lead to the MARVEL assembler that assembled the 32GB axolotl genome. Steve Marsh talked about improving HLA typing. Henk Buermans and Bobby Sebra both emphasized the importance of using long reads to resolve complex drug genes such as CYP2D6 as a pathway towards personalized medicine. There were many, many other interesting talks, but one took me by surprise, because it used a dataset I had helped make public.

In 2013, I helped release the MCF-7 breast cancer cell line transcriptome data. The sample was sequenced routinely for R&D purposes. For our molecular biologists who were working to improve the library protocol for sequencing full-length cDNA on PacBio machines — — what we call the Iso-Seq method — -, the interest in the data faded as soon as they confirmed the transcripts were indeed full-length and the data yields were good. For me, the bioinformatician who was developing algorithms to process Iso-Seq data, it seemed like there were more treasures to be found. I asked for the data to be released. Later, in 2015, when even more MCF-7 data was generated, I released the extended dataset as well.

Leiden, Netherlands.

On the second day of SMRT Leiden, right between Hagen Tilgner’s keynote and my own talk, was a talk with the poetic title “Coordination or chaos in the making of mRNA”. My jaw dropped as Yahya Anvar, the co-host of the event, described how he had analyzed the MCF-7 dataset, as well as another human Iso-Seq dataset I had helped release, to look for systematic coupling events of transcript features (start sites, exons, and polyadenylation sites). The hypothesis made sense: alternative splicing means that a gene with N exons can have up to 2N isoforms, yet in reality no gene seems to transcribe all 2N combinations. Even before long reads came along researchers studying splicing using short reads, which could only look at individual junctions and not distal coupling events, did not report seeing promiscuous use of all possible splicing combinations. Now that full-length transcript sequences are here, the questions have become tangible: How prevalent is preferential coupling in gene splicing? Is it sample specific? What regulates this preference?

For the MCF-7 dataset, Yahya found >60% of the multi-exonic genes (6825 out of 11350) to have at least one coupling event. Of the 6825 genes, 2700 have coupling events for all feature types (TSS-exon, exon-exon, exon-PAS, TSS-PAS). Previous studies have shown that polyadenylation is coupled with the last intron; this study shows that it’s not just the last intron: the whole gene, from the start site to the internal exons to the polyadenylation site, can be preferentially coupled. Analysis of polyadenylation signals revealed that, while most 3’ ends use known PAS motifs, a novel motif (AKCCTGG) was found to be elevated in PASs coupled with TSSs. This novel motif is associated with the muscleblind-like (MBNL) protein, which is known to play a role in splicing and polyadenylation. For the Human Brain/Heart/Liver Iso-Seq dataset, which was not as deeply sequenced, the same coupling analysis was performed and it was concluded that coupling events are largely condition- or tissue-specific. The small portion of coupling events that are conserved across the samples, however, are enriched for mutually exclusive and mutually inclusive events.

Percentage of genes and features that were significantly coupled in the MCF-7 data. Taken from Figure 2 of the paper.

My shock at hearing the talk, having absolutely no idea someone had been analyzing the MCF-7 and human brain/heart/liver data to this degree, was surreal. My reaction went from “OMG someone is using the data! It was not a waste of time!” to “This is so cool, I need to ask for his slides later because he is talking too fast” to “Wait…I must tell him about Gloria and the mass spec data!”

A year earlier, Gloria Sheynkman, then a PhD student at University of Wisconsin-Madison, was trying to find full-length sequencing data to match mass spectrometry data. Using the MCF-7 Iso-Seq dataset, along with publicly obtained MCF-7 mass spec data, she showed that there are novel peptides not found in the UniProt/SwissProt databases that were unique to MCF-7. She graduated and went to work at the Dana Farber Cancer Institute (she would actually attend SMRT Leiden 2017 to talk about her work at the Vidal lab), and her MCF-7 mass spec work was never published.

I went to Yahya after his talk and told him about Gloria’s work. This turned into a year-long three-way collaboration between Yahya, Gloria, and me, where we would analyze, re-analyze, and re-re-analyze the MCF-7 Iso-Seq and mass spec data, until we were confident with our findings. The challenge was that the mass spec data was low throughput and short; the average peptide was ~15 aa long. This meant we could not unambiguously assign a peptide to a unique transcript. Instead, we classified each peptide match according to whether it matched a single isoform of a gene, a group of isoforms of a gene, or all isoforms of a gene. The goal was to see if we could validate particular alternative splice patterns with unique peptide matches. We found 38k peptide hits to the Iso-Seq dataset, of which ~10k were associated with mutually inclusive exons; few hit mutually exclusive exons. Similar to Gloria’s original analysis, we found 358 novel peptides that were found in the PacBio Iso-Seq ORF predictions but not Gencode. Most of these novel peptides are single amino substitutions.

Example of multiple isoforms with peptides that matched to all transcripts (black) or a subset of transcripts (yellow). Taken from Figure 4 of the paper. The exons shown in red are coupled together and is mutually exclusive of the alternative end shown in lightblue.

While the mass spec findings weren’t world-changing, I was glad it made it into the manuscript. Transcription is only the first part of expression. For coding genes, post-transcriptional regulation and translation would determine the ultimate cell fate. The mass spec analysis prompted me to think that perhaps, instead of thinking of individual isoforms as independent units, one should think of groups of isoforms that encode the same proteins as a functional unit.

There seems to be both coordination and chaos in the making of mRNA. On one hand, the MCF-7 work shows that splicing is coordinated. On the other hand, certain genes seem very willing to transcribe an inordinate amount of isoforms. An early Iso-Seq paper found 247 unique isoforms for the neurexin 1alpha gene. My collaboration with Flora Tassone forced me to think about the meaning of the 47 FMR1 isoforms detected mostly in premutation carriers of Fragile X-associated Tremor/Ataxia syndrome (FXTAS) but not in the control group. Did the disease state create transcriptional chaos that resulted in an elevation of the previously undocumented splicing patterns we observed? Do all of these isoforms get translated into proteins? For those that have the same open reading frame, does it matter which isoform was expressed, or is it more important what the total expressed protein amount was?

Gloria and I are not alone in seeing the need to connect isoforms to proteins and using sample-specific databases, instead of generic databases like Gencode, to validate mass spec data. At the same 2017 event, Gosia Komor in Fijneman’s lab at the Netherlands Cancer Institute was already plowing through their SW480 colorectal cancer cell line to identify differentially expressed isoforms by combining mass spec, RNA-seq, and Iso-Seq data. They found that the Iso-Seq data not only increased the number of isoform-specific peptide matches but also was able to identify peptides supporting intron retention that was previously missed by their pipeline. The proteomics community is transitioning out of the “short peptide” era as well, with Gloria’s former advisor Lloyd Smith recently penning a piece in Science calling for the top-down “Proteoform” approach.

Walking around Leiden after the conference was over.

Greater biological discovery will likely come from combining existing databases and multiple sequencing technologies. This increases the burden on bioinformatics analysis. Since the Iso-Seq method was developed, the list of community tools have been steadily growing. The latest of these tools, like SQANTI and TAPPAS, are trying to address the issue of combining heterogeneous data, functional annotation, and visualization.

I missed out on SMRT Leiden 2017, but plan to be in Leiden again this June. I am looking forward to another jaw-dropping experience.

Publication:

Anvar, S. Y. et al. Full-length mRNA sequencing uncovers a widespread coupling between transcription initiation and mRNA processing. Genome Biol. 19, 1–18 (2018).

--

--

Liz T
PacBio
Writer for

All things RNA. Bioinformatics. Opinions are my own.