Single-cell sequencing on DNAnexus: Smart-Seq2 pipeline

Published in

DNAnexus Science Frontiers

9 min readOct 12, 2021

Single-cell sequencing on DNAnexus: Smart-Seq2 pipeline

Prepared by: Anna Vavrikova and Janka Puterova

It did not take long after the first sequencing of the human genome for next-generation sequencing (NGS) technologies to become a standard in biotechnology, and the pharmaceutical industry. For transcriptomic studies, the routine microarray method was replaced by RNA sequencing (RNA-Seq), which allowed for the study of other phenomena for which microarrays were not sufficient, such as alternative splicing and isoform usage, with greater accuracy and resolution. RNA-Seq is based on bulk, the result is average of all cells from the studied tissue.

In 2009, Tang and his colleagues performed the first mRNA-Seq based on a single cell, the blastomere. Their success led to an explosion of other single-cell sequencing protocols, enabling experiments involving more than one cell, using a small amount of starting material, with deeper sequencing, in a more automated way. Nowadays, as many as one million cells can be studied in one single-cell study.

Single-cell sequencing provides the same information as RNA-Seq, but with resolution at the single-cell level. This gives single-cell experiments some advantages, including the ability to study cell subpopulation within tissues, trajectory analysis, and so on.

However, even single-cell sequencing has its limitations. Data produced have a higher level of noise. Experiments also have low capture ability and therefore have inherently higher amounts of dropouts, producing zero-inflated data requiring special statistical approaches. Moreover, different single-cell protocols, such as Drop-Seq, STRT-Seq, MATQ-Seq, require different data processing and analysis. In many cases, the analyses require special tools, and only a few tools used in bulk RNA-Seq data processing can be applied on single-cell data.

As my internship project, I migrated the publicly available Smart-Seq2 multi-sample workflow written in workflow description language (WDL) from the Broad Institute’s project WARP (https://broadinstitute.github.io/warp/) to the DNAnexus Platform. This workflow will allow users to process data from increasingly popular Smart-Seq2 experiments on the platform. We have added an aligner selection option to the original workflow, and we also ported another workflow which builds all the necessary indexes and reference files for the Smart-Seq2 multi-sample workflow. We believe this approach will provide the user with a friendly secondary analysis environment.

Single-Cell Sequencing: Current Approaches

There are several single-cell sequencing protocols that have emerged since Tang’s 2009 publication. In general, they follow one of two approaches (summarized in Table 1).

The first is based on shallow sequencing of a large number of cells, in which only the 3' or 5' ends of the transcripts are sequenced. Examples of protocols belonging to this category are Drop-Seq, STRT-Seq or 10x Genomics Chromium. If you are interested in tools for analyzing 10x Chromium data, reach out to our support team at support@dnanexus.com.

In the second approach, the entire transcript of a smaller number of cells is sequenced. In contrast to the previous approach, the sequence libraries provide more depth. An example of such a method is SmartSeq.

The different approaches have the potential to answer diverse questions. The high throughput, shallowly sequenced libraries are suitable for discovering cell subpopulations in tissues. The low throughput, deeply sequenced libraries can be advantageous for the detection of genes with low expression levels, and for providing a broader spectrum of possible downstream analyses like isoform usage or allelic expression detection.

Table 1: Comparisons of two approaches used in single-cell experiments.

The SMART-SEQ2 Approach

As noted, Smart-Seq2 (SS2) is among the methods producing deep libraries with a smaller number of sequenced samples/cells. How does it work? Library preparation starts with sorting and isolating single cells, followed by cell lysis. To produce a full-length cDNA from an isolated mRNA, the template-switching mechanism is utilized (thus the origin of its name, Switching mechanism at the 5´ end of the RNA transcript). During cDNA synthesis, a special reverse transcriptase is used. It adds a few (2–5) nucleotides at the 3' end of a newly synthesized strand. These nucleotides then hybridize with template-switching oligonucleotide (TSO), which allows template switching for reverse transcriptase. Full-length cDNA is then amplified. In the next step, cDNA is ‘tagmented’ (addition of adapters and fragmentation by transposase). Fragments that arise from this procedure are then sequenced.

Subsequent (secondary) data analysis is partially similar to analysis performed when processing bulk RNA-Seq data. Inherent quality control and read trimming is the first step. The next step is almost always mapping to a reference sequence (either genome or transcriptome), and mappers/aligners commonly used for bulk RNA-Seq, such as STAR, HISAT2, or Bowtie2, can often be used. Note, however, that the Smart-Seq2 protocol does not support the usage of unique molecular identifiers (UMIs), which are typically used in protocols that are based on sequencing at either 5' or 3' end. Alignment can be further used for read quantification/expression estimation (FeatureCounts, RSEM, HT-Seq, pseudo-aligners) in order to obtain an expression matrix, which is an important input for tertiary analysis.

The Workflow

The ported WDL workflow was designed by Data Sciences Platform et al. from the Broad Institute. As a portable workflow language, WDL has the advantage that it can be run on a variety of systems that have a WDL execution engine. My task was to port it onto the DNAnexus Platform.

The workflow takes a set of files as an input. Each input file represents one cell/sample. For each cell, the same procedure is applied. It consists of two main processing branches, whose results are later combined into a loom file. Multi-sample pipeline overview is demonstrated in Figure 1.

**Figure 1: Multi-sample Smart-Seq2 pipeline overview.**

The first branch of the analysis focuses on quality control. The raw reads are aligned to the genome. The default aligner is HISAT2, a fast and sensitive aligner based on the graph version of BWT (Burrows-Wheeler transform) and FM index. The resulting alignment (in BAM format) is then subjected to analysis by Picard tools. Picard consists of a set of command-line tools for the analysis of high-throughput sequencing data. The workflow uses three Picard tools: CollectMultipleMetrics (calculates alignment metrics based on bam file), CollectRNASeqMetrics (looks at the distribution of aligned bases among transcripts, median coverage, or 3'/5' end coverage) and MarkDuplicates (identification of reads coming from the same molecule/fragment).

The second branch of the workflow is responsible for the quantification of expression. The quantification begins with alignment to the transcriptome by default performed by HISAT2. Parameters of HISAT2 are set so that the resulting BAM file is suitable for the RSEM model. RSEM-calculate-expression then estimates gene and isoform expression.

Most of the files generated by either branch are then grouped using SCTools. The result is a CSV file that contains various statistics and metrics. A reader can view the single-sample pipeline in Figure 2.

**Figure 2: Detailed workflow for a single sample.**

For convenience, the loom file is generated and contains all suitable information for tertiary analysis: RSEM count matrices (TPM and expected count), pipeline metadata, cell metadata (sample/cell type, name, sample metrics from alignment logs and Picard analyses, RSEM.cnt) and gene metadata (name(s) and identifiers based on GTF). Loom files from individual cells/samples are then combined into one final output loom file.

**Figure 3: Loom file structure, adopted and modified from:** **http://linnarssonlab.org/loompy/index.html**

Table 2: Overview of the software used in the Smart-Seq2 workflow.

Optimization

Building applets in WDL comes with several benefits that make running WDL workflows smooth and efficient. We were able to take advantage of the option to dynamically set runtime parameters on WDL applets.

We optimized the runtime parameters so that the execution price would be as low as possible for the pipeline version with HISAT2. We started by digging through past execution logs to see if the selected instance types were suitable, and that resources were not being underutilized. We were able to lower many requested resources, such as disk size and machine memory. For almost all tasks, we found that moderately reducing disk size can decrease costs significantly, and that increasing the number of CPUs can lead to reduced runtime. In the case of aligning and quantifying, the saved time can even compensate the price for a more powerful instance; see Figures 4 and 5 for graphs demonstrating price in relation to runtime parameters of HISAT2 genome alignment task and RSEM quantification task. Some tasks were already well-optimized by the authors of the workflow. All of my optimization attempts were always tested for just one cell, and the savings from parallel runs of multiple cells will add up.

An important note is that the pipeline was optimized on mouse and human data (reference genome size: 2.7 Gb, 3.1 Gb respectively). When using this pipeline for species with larger genomes, we recommend a more careful approach. But for all cases, all runtime parameters (CPUs, machine memory, disk storage, docker images) are accessible and adjustable.

**Figure 4: Price of HISAT2 genome runtime step for different runtime parameters. Disk size was computed dynamically; users can adjust the multiplier and buffer in this equation.**

**Figure 5: Price for RSEM runtime for different runtime parameters. Disk size was computed dynamically; users can adjust the storage buffer in this equation.**

Options to Customize the Workflow

In the course of porting the pipeline, we came up with an idea to extend the workflow by allowing a user to choose which aligner to use: STAR, Bowtie2, or the default HISAT2.

All of these aligners are compatible with RSEM. Aligning to the transcriptome is implemented in an RSEM-compatible manner. Both analysis branches — aligning to the genome and aligning to the transcriptome — are always executed by the same aligner/mapper. The only deviation from default style execution is when STAR is used for aligning to the transcriptome. It uses STAR genome index with set option, quantMode TranscriptomeSAM, which will make STAR output a BAM in transcriptome coordinates. That is also the reason why the reference annotation GTF file is compulsory input for a STAR-transcriptome task.

From a performance perspective, both HISAT2 and STAR look like solid choices. STAR has larger index files, which leads to its higher memory demands. Also, more care was given to HISAT2 when runtime parameters were optimized. Bowtie2 comes out of this comparison as a more expensive and time-consuming option (See Table 3).

Table 3: Aligner comparison. Building index is a one-time thing. For STAR, only the genome index is built. Mapping was performed using FASTQ with 34.5 million reads.

The workflow needs three types of indices and various other annotation files as inputs. To improve user experience and spare some pain, we also ported a workflow for building and obtaining all of the required files. The workflow starts with downloading a user-defined version of the reference sequence and annotation from the GENCODE database. Based on these files, it builds the genome index, RSEM index, and transcriptome index for the user’s choice of the aligner (it can build indices for all of them in one execution). In case you are not working with human and mouse data, or you do not want to work with GENCODE, the workflow also provides an option to download genome references (FASTA and annotation GTF) from user-provided URLs, and then proceeds in the same manner as described previously.

The workflow is currently available upon request; please contact support@dnanexus.com to obtain access.
This research was performed by Anna Vavrikova as part of her internship with DNAnexus. The project was supervised by Janka Puterova and assisted by Allison Regier, Ph.D., David Stanek, Ph.D., Jananan Pathmanathan, Ph.D., and Cecilie Boysen, Ph.D.

RESOURCES

Literature

Ding, J., Adiconis, X., Simmons, S.K. et al. Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nat Biotechnol 38, 737–746 (2020). https://doi.org/10.1038/s41587-020-0465-8

Andrews, T.S., Kiselev, V.Y., McCarthy, D. et al. Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data. Nat Protoc 16, 1–9 (2021). https://doi.org/10.1038/s41596-020-00409-w

Picelli, S., Björklund, Å., Faridani, O. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat Methods 10, 1096–1098 (2013). https://doi.org/10.1038/nmeth.2639

Chen Geng, Ning Baitang, Shi Tieliu Single-Cell RNA-Seq Technologies and Related Computational Data Analysis. Frontiers in Genetics 10, 317 (2019). https://doi.org/10.3389/fgene.2019.00317

Data

Psarras A, Alase A, Antanaviciute A, et al. Functionally impaired plasmacytoid dendritic cells and non-haematopoietic sources of type I interferon characterize human autoimmunity. Nat Commun. 2020;11(1):6149. Published 2020 Dec 1. doi:10.1038/s41467–020–19918-z

Sloan SA, Darmanis S, Huber N, et al. Human Astrocyte Maturation Captured in 3D Cerebral Cortical Spheroids Derived from Pluripotent Stem Cells. Neuron. 2017;95(4):779–790.e6. doi:10.1016/j.neuron.2017.07.035