Decoding A Coastal Giant: Full-Length Transcriptome of Sequoia sempervirens

7 min readJan 18, 2021

--

*Visiting the Humboldt Redwoods State Park with my children*

Ancient Giants of The Land

One of the great things about living in California is being able to experience the wonders of nature up close and personal.

The coastal redwood, Sequoia sempervirens, is one of the tallest living trees on Earth. It only grows along a narrow strip of land along the Pacific coast, mostly in California. The trees are magnificently tall — often reaching up to 300 feet in height.

I first learned about the coastal redwood in the Humboldt Redwoods State Park when I took my children to the California Academy of Sciences in 2018. For the Giants of Land And Sea exhibit they created a 360 interactive display of the state park’s redwood forestry, showing how the ecosystem varied from the forest floor to the canopy. I was immediately mesmerized by these giant, beautiful beings. As soon as I had the opportunity, I took my children camping and hiking in Humboldt Redwoods State Park.

Walking through the redwoods was awe-inspiring. I distinctly remember how quiet it was. If there were birds, they were so high up in the canopy, I could not hear them. Our smallness was contrasted against the enormity of these silent giants. If there ever was a place to be humbled, this would be it.

It is even more humbling to have the opportunity to work on the transcriptome for this giant genome.

*Fallen redwood tree at Humboldt Redwoods State Park*

A Genome Fit For A Giant

While much of genomics research that is in the spotlight is focused on human health, preserving biodiversity through sequencing has an important role. Efforts such as the Darwin Tree of Life and International Barcode Of Life are a race against time to capture the uniqueness of many species on Earth. Not all species are as hardy as redwoods, and have the potential to regrow.

In 2020, my colleague Michelle Vierra, inspired by what she heard at the Plant & Animal Genome Conference, led the effort at PacBio to collect and sequence S. sempervirens. It was a daunting task at the time — for coastal redwood is a hexaploid 26 Gb genome! Using HiFi reads and Hifiasm for assembly, they achieved a genome assembly size of 48.5 Gb, contig N50 of 3.8 Mb, and BUSCO completeness 61%. At first glance, the BUSCO score seems low and may lead one to think that the genome is incomplete. The BUSCO scores are based on the embryophyta database, which is quite far removed from conifers, so it’s not shocking that the number is so low. A closer database would be ideal. In fact 61% actually represents a nearly 10% increase from, other recent conifer genomes such as the Lolblolly pine. A more accurate assessment, as I will argue in the following sections, is to use a matching full-length transcriptome (Iso-Seq) library.

Redwood Iso-Seq Reveals Complex Alternative Splicing

Michelle later went back to collect the needles from the same redwood tree. Iso-Seq libraries were made and sequenced by the University of Delaware DNA Sequencing and Genotyping Center. Two Sequel II SMRT Cells yielded a total of 5.3 million full-length reads. After running through standard Iso-Seq analysis, I mapped it to the hifiasm v12 assembly of the PacBio redwood genome.

Of the 336,853 high-quality (HQ) Iso-Seq transcripts, 323,720 (97.6%) mapped to the PacBio v12 genome with greater than 95% alignment coverage and 90% alignment identity. We obtained a total of 69,198 mapped loci and 205,792 unique, full-length transcripts. (This is done without any filtering, so the final number of genes is likely lower).

*Figure 1. Mapped redwood Iso-Seq transcript length distribution and isoform complexity.*

The mapped transcripts ranged from 50 bp to 14.2 kb with a mean length of 2.9 kb. While most of the loci had only 1–5 isoforms, there were many that displayed complex alternative splicing patterns, highlighting the power of full-length transcript sequencing (see: Figure 2a).

Figure 2. Iso-Seq mapping example. (a) BLASTN hits a predicted DEAD-box ATP-dependent RNA helicase 20-like transcript. (b) Alternative splicing at a putative zinc finger locus resulted in predicted ORF changes.

I found several aspects of the Iso-Seq data exciting. One was the ability to see alternative splicing. Another was the ability to predict ORFs directly from the sequences. And lastly, when I BLASTed the Iso-Seq transcripts back to the NR database, many of the hits were to other plants with annotations such as “hypothetical protein” and “unknown mRNA”. An example is shown in Figure 2b, where the gene PB.209 showed three isoforms, two of which result in the same predicted ORF. The BLASTN hit was to “unknown mRNA”, but when I did a BLASTP using the predicted ORFs, it hit a zinc finger protein with ~50% similarity.

Isoform Phasing For A Hexaploid Genome? Yes, Please!

I had developed IsoPhase for maize (see Wang et al. 2020), which was a diploid genome. I did not know if IsoPhase was going to work for a complex hexaploid tree genome…and it appears that it does!

Figure 3 shows an example of the power of combining a phased genome with a phased transcriptome. There are six genome haplotigs at this locus, but two of them diverged only at non-coding regions, so IsoPhase only reported 5 alleles.

Figure 3. IsoPhase showing 5 distinct alleles. (top track) genome haplotigs mapped back to the contig; (middle) Iso-Seq full-length reads grouped and colored by IsoPhase-inferred alleles; (bottom) Iso-Seq transcripts. Note that, while there are 6 genomic haplotigs, two alleles diverge only in SNPs in the non-coding region of this locus.

Not all IsoPhase loci looked this clear cut, of course. In some cases IsoPhase identified genes that were likely to be homologous genes that were grouped together by the Iso-Seq cluster algorithm. This isn’t necessarily a bad thing, as Iso-Seq cluster was designed to identify differences at the splicing (exonic) level. In fact, given the PacBio redwood genome has not been annotated, IsoPhase (and Cogent, as shown in the next section) can be used as a tool to identify either homologous gene families or genome assembly issues.

Using Iso-Seq To Assess Genome Quality

I’ve always thought using Iso-Seq to assess genome assembly was an interesting and important idea. In Warr et al., we used Iso-Seq anaylsis to identify 5 missing genes in the assembly.

The first question I looked into after mapping the Iso-Seq transcripts to the genome was — what were the genes that were completely unmapped? Were they missing genes?

To answer this question, I used Cogent , another tool I developed, to identify gene families based on the k-mer similarities of the Iso-Seq transcripts. Once gene families were identified I ran BLASTN to see if the unmapped or poorly mapped Iso-Seq gene families had any hits to the NR database.

A substantial number of the Iso-Seq gene families from the Cogent result had BLASTN hits to chloroplast. The chloroplast is not expected to be part of the PacBio redwood genome assembly. Meanwhile, some gene families mapped to ribosomal proteins. More work needs to be done to look into whether these are truly missing genes.

The below shows an example of an Iso-Seq gene family that contained two sets of transcripts, mapped approximately 300 kb apart. Both sets of transcripts did not map fully (soft-clipped at the end) and both had the same BLASTN hits to ATP-dependent helicase gene. The genomic coverage showed uneven coverage for the second half of this region, indicating a potential assembly issue.

Figure 4. Using Cogent to identify partially mapped Iso-Seq gene families that could identify genome assembly issues. (top) Mapped coverage and alignment of genomic HiFi reads, which were input to the Hifiasm assembly (bottom) Mapping of Iso-Seq transcripts.

Summarizing Redwood Iso-Seq Analysis Results

The high mappability of the Iso-Seq data to the PacBio v12 genome has shown that the genome assembly is quite complete in terms of coding regions. The Iso-Seq data can be phased to identify allele-specific isoforms. Missing genes or difficult-to-assemble gene regions can be assessed using Iso-Seq transcripts. Finally, the Iso-Seq transcripts can be directly used for ORF prediction.

Want To See More Redwood Iso-Seq? Dig In!

There’s a lot more to be analyzed for this dataset! We’ve released the Iso-Seq dataset, including the transcript sequences, GFF files, BLASTN hits, IsoPhase and Cogent results. We welcome the community to use this dataset for research, tool development, and give us feedback.

The redwood Iso-Seq data has been released at: https://downloads.pacbcloud.com/public/dataset/redwood2020/isoseq/

Acknowledgement: Thanks to my coworker Greg who assembled the redwood genome and helped me with the assembly assessment! And of course Michelle for spearheading the project and doing the sample collection!

Decoding A Coastal Giant: Full-Length Transcriptome of Sequoia sempervirens

Ancient Giants of The Land

Written by Liz T