Finding Differentiating Isoforms in Bone Marrow Subpopulations: an interview with Anne
I still have the original email from Anne Deslattes Mays, sent to me on a Saturday in January of 2014, when I was working in a coffee shop in the Bay Area.
The email was a complaint about having issues installing and running the first iteration of the Iso-Seq software I had released on GitHub (called ToFU at the time). At that time, the Iso-Seq software was not yet available through PacBio’s official SMRT Analysis suite.
I remember sitting in the coffee shop thinking, “Oh, someone actually is using my code!” In case it’s not clear, having someone else use your code is a badge of honor for a developer.
Five years later, the project Anne was working on for her PhD at the time — using full-length transcriptome sequencing to analyze human bone marrow cell populations — is finally published in Genes this month. Good things come to those who wait.
Though one may think the data is now old (it was sequenced on the PacBio RS II platform), it still remains one of the few human Iso-Seq papers that looked beyond the complex splicing happening at the transcription level and asking, in my mind, a much tougher question: Does any of the complex splicing matter at the functional level? Anne’s study may have only scratched the surface, but I hope by reading the Q&A interview I have with Anne (with input from Anton) below, readers may be inspired to tackle this challenging problem.
What was the motivation of using Iso-Seq for your PhD thesis?
ANNE: Let me first provide the backdrop for doing full-length transcriptome sequencing.
At that time, I had been working with Keygene to create a bioinformatics group in the US. Keygene is a plant genomics company based in the Netherlands. Because Keygene relies on traditional breeding methods, improving plants involves searching for negative regulators. I created for them a discovery platform, KeySeeQTM, based on time-course data that subjected plants to different stress conditions. At first, we worked with microarray data, where there was no ambiguity as to which probes mapped to which genes. Then we moved onto short read RNA-seq data and suffered multi-mapping issues. The fragmented nature of the short read data meant we could not obtain full open read frame (ORF) information, and this was particularly devastating when working on organisms lacking reference genomes.
I arrived at the idea to use long read RNA-seq — the PacBio Iso-Seq method — which was new and mostly unheard of at that time, to get unfragmented full-length transcripts. I was also a PhD student in Anton Wellstein’s lab at Georgetown University at that time, so I decided to apply Iso-Seq to human bone marrow cells.
Why choose human bone marrow cells?
Anton had been working with human bone marrow cells for some time and had written a protocol permitting us to collect left-over bone marrow samples that would have otherwise been discarded in the waste at the hospital. Thus, sample collection was easy. Further, with bone marrow, we had the advantage of extracting phenotypically distinct cell populations using antibodies as markers. Though we had previously obtained gene expression that separated the cell populations using short read RNA-seq data, we were dissatisfied with the lack of clear biological differences. So, we decided to see what the new Iso-Seq method would tell us.
You took human bone marrow samples and segregated them into lineage-positive and lineage-negative — did you already know that there would be certain genes/isoforms that segregate the populations?
Yes! At a minimum, we expected most pulled down cells would have surface markers matching the antibodies we used. The lineage positive cells were targeted by tetrameric antibody complexes recognizing CD2 (T-cell surface antigen), CD3 (T-cell surface glycoprotein), CD5 (another T-Cell surface glycoprotein), CD11b (Integrin alpha-M — ITGAM), CD11C (Integrin alpha-X- ITGAX), CD14 (Monocyte differentiation antigen), CD16 (low affinity immunoglobulin gamma F- FCGR3A or FCGR3B), CD19(B-lymphocyte antigen), CD24 (signal transducer — pivotal role in cell differentiation), CD61 (Integrin beta-3 — platelet membrane glycoprotein IIIa), CD66b (Carcinoembryonic antigen-related cell adhesion molecule 8 ) and Glycophorin A (CD235a — major intrinsic membrane protein of the erythrocyte) and dextran-coated magnetic particles.
We ran this several times to get our lineage negative cell population as clean as possible and increase our cell count for sequencing.
The study focuses on two genes to demonstrate the difference of full-length RNA-seq (Iso-Seq) vs short read RNA-seq. For EEF1A1 gene, you found only 4 isoforms using short reads but more than 40 isoforms using Iso-Seq. Why do you think short read only found 4 and so many with Iso-Seq?
It is unlikely that short read sequencing depth is the issue, since we had 100 million reads for the lineage negative population and 20 million reads for the total population. We believe that there is an inherent threshold for the number of isoforms detectable by short read RNA-seq despite the possible increase in the number of exons/junctions detected.
This question was addressed in the paper’s Figure 5: The number of transcript isoforms detected by short read RNA-seq data remained much below 10 regardless of the number of exons for that transcript, whereas the number of detected isoforms increased with the number of exons when using PacBio Iso-Seq (FL RNA-seq in the figure). More than half of our transcripts in the Iso-Seq data mapped to loci with four or more exons and nearly a third of them mapped to loci with eight or more exons. The short-read RNA-seq data mapped only 13% to loci with more than 4 exons and only 5% with more than 8 exons (Table S5c).
Another issue with short read data is with highly paralogous genes (Figure 5c). A good example is the CFD genes, which is co-located with ELANE and their proteins are 78% homologous (Figure S2). Another example is the HLA genes that are >80% identical. Paralogs add complexity and prevent correct transcript assembly from fragmented short reads.
Why is it important to do mass spectrometry validation?
Mass spectrometry is an orthogonal technology that confirms that these novel transcript isoforms are in fact getting translated! Proteomics is still a challenging field and only produces fragmented data (through tryptic digestion). With the full-length transcriptome data, we are now beginning to peer into the proteomics space that was not previously achievable using short read transcript assemblies. The ability to augment an ORF database using long read RNA-seq data is huge for discovery work.
For example, for the EEF1A1 gene, the previously unknown N7 protein isoform contained a unique tryptic peptide fragment that was distinct from the canonical protein and was validated by mass spectrometry (see Figure 3 in paper).
Another example shows the proteins for HLA-A, HLA-B, and HLA-C confirmed by non-targeted shotgun proteomics data extracted from the lineage-negative cell population. We translated the Iso-Seq transcripts into ORFs to make a database for mass spec matching and were able to confirm the expression of the HLA genes.
What did the Iso-Seq data reveal about the differentiated bone marrow cell populations?
Hematopoiesis has been the paradigm for the evolution of differentiated cells from progenitor cells. That was one motivation to study human hematopoietic progenitor and differentiated cell transcriptomes. Quite strikingly, the Iso-Seq data showed that we can distinguish differentiated from progenitor cells by just looking at individual gene isoforms. The gene examples picked in the paper show that very well (Fig. 2 and 4): There is no need to use gene expression levels, just the qualitative difference will be informative on cell subpopulations.
Where do you see the potential for the Iso-Seq?
People should be using Iso-Seq for a lot more discovery work! Making Iso-Seq quantitative would be an important step. We would also need more proteomics data, for sure.
A lesson I learned from doing plant work that I am now bringing to the human and mouse world, is that while it is nice to have a reference genome (or, reference genomes), it is not essential. If it were up to me, I would want to start with a baseline transcriptome, then do a time-course experiment with contrasting phenotypes and identify causal factors.
And instead of doing gene-level enrichment analysis, we should start doing isoform-level enrichment analysis. Traditionally, you would do a gene set enrichment analysis to get Gene Ontology (GO) terms that tells you what the differentiating features are at the gene level. Now that we have isoform-level information, the differentiating features may be gain or loss of protein domains — how would that affect the protein networks? The tools are already there, we just need to expand them to understand isoforms instead of genes, and protein domains instead of proteins.