RNA Velocity Analysis Part 1 — Generating Spliced and Unspliced counts

7 min readMar 13, 2024

In this tutorial, we’ll be looking into how spliced and unspliced counts can be generated to carry out RNA Velocity analysis using scVelo.

Let us first understand what spliced and unspliced counts/reads mean before proceeding with the tools to generate them.

Imagine that we are baking cookies. Our recipe is like a gene in a cell, and the cookies we bake are like the proteins that the gene makes.
- Spliced Counts: Spliced counts are like counting how many fully baked cookies you have on your cooling rack. These cookies are ready to eat because they’re fully baked, just like spliced RNA molecules are ready to be used by the cell to make proteins.
- Unspliced Counts: Unspliced counts are like counting how many cookies are still in the oven or are only halfway baked. They’re not fully ready yet, just like unspliced RNA molecules still need some processing before they can be used by the cell.
So, by counting both spliced and unspliced RNA molecules, scientists can understand how many “fully baked cookies” (spliced RNA) and how many “halfway baked cookies” (unspliced RNA) are in each cell. This helps them understand how active each gene is in making proteins, which is important for understanding how cells function and respond to different conditions.

Jumping back into generating the spliced and unspliced counts, the Requirements for the process are as follows:
- 10x Genomics cloud account or Cell Ranger CLI setup in your system.
- Access to a HPC environment (with conda or Python support)
- velocyto CLI tool

Cell Ranger, developed by 10x Genomics, is a popular software package commonly used for the analysis of single-cell RNA sequencing (scRNA-seq) data generated from their Chromium platform. This raw data typically consists of sequencing reads generated from individual cells, which need to be processed, aligned to a reference genome, and quantified to obtain gene expression profiles for each cell. Cell Ranger automates these steps and provides various output files containing processed data, quality control metrics, analysis results, and visualizations to help researchers interpret and analyze their scRNA-seq experiments. When we run Cell Ranger on our scRNA-seq data, we typically obtain several output files containing various types of information about our experiment.

A typical cellranger output directory looks like the below hierarchy:

Sample_Name/outs contains the results of the Cell Ranger analysis and they are as follows:

(NOTE: Sample_Name is the directory named after our sample.)

analysis/: This directory contains results from clustering and differential expression analysis.
cloupe.cloupe: This file is used for visualization in the 10x Loupe browser.
filtered_feature_bc_matrix/: This directory contains the filtered count matrix, including barcodes, features, and counts.
- barcodes.tsv.gz: File containing cell barcodes.
- features.tsv.gz: File containing gene IDs.
- matrix.mtx.gz: File containing the count of unique UMIs for each gene in each cell.
filtered_feature_bc_matrix.h5: HDF5 file format of the filtered count matrix.
metrics_summary.csv: Summary metrics from the analysis.
molecule_info.h5: HDF5 file containing per-molecule read information.
possorted_genome_bam.bam: Aligned reads in BAM format.
possorted_genome_bam.bam.bai: BAM index.
raw_feature_bc_matrix/: This directory contains the raw count matrix.
- barcodes.tsv.gz: File containing cell barcodes.
- features.tsv.gz: File containing gene IDs.
- matrix.mtx.gz: File containing the raw count of unique UMIs for each.
raw_feature_bc_matrix.h5: HDF5 file format of the raw count matrix.
web_summary.html: Summary report of the analysis.

In this tutorial, we assume that the cellranger pipeline has been utilized and the output files have been generated beforehand. If the files have not been generated, here is a short to-do on how that can be done.

Setup cellranger pipeline in your system by following the steps mentioned in the Installation and CLI tool documentation with the respective configurations.
If you do not want to install and setup things in your system, you can make use of 10x Genomics Cloud Analysis provided by 10x Genomics.

NOTE: For both the options, ensure that the below files are present:

The raw sequencing data (FASTQ files) generated from the scRNA-seq experiment. These files contain the reads from individual cells that need to be processed.
The reference transcriptome that is to be used for alignment and quantification. This typically involves selecting a pre-built reference from the 10x Genomics reference library or uploading a custom reference, depending on the experimental setup and species.

Up next, let’s gain access to a HPC(High Performance Computing) environment, and follow the below steps to install and setup velocyto!

Firstly, ensure that the cluster allows loading Anaconda/Python modules. Here, we will be using conda(which can also be replicated using pip).
Once the modules have been loaded, we have to create a new conda environment as:

conda create -n environment_name python=3.9

Now, let’s activate the environment by running:

conda activate environment_name

source activate environment_name

Now, let’s install the required packages by:

conda install numpy scipy cython numba matplotlib scikit-learn h5py click

and

conda install -c bioconda samtools

Finally, let’s install velocyto by running:

pip install velocyto

We should now be able to check if velocyto was successfully installed. We can do this by running:

velocyto --help

velocyto --version

which will provide us with the options available with velocyto or the version number of the package.

If you wish to install from source and not using pip, please refer to the commands provided here.

AWESOME! We are almost there now :)

Now comes the most important part where we have to use velocyto to generate the spliced and unspliced counts. But wait, what exactly is velocyto and what does it do?

Velocyto, is a software tool used for studying how gene expression changes over time in single-cell RNA sequencing data. It helps us figure out the speed and direction of RNA movement in individual cells by separating unspliced and spliced RNA in standard single-cell RNA sequencing methods.

Jumping back to the tool and setup, velocyto provides us with subcommands for running the tool on 10x Chromium Samples, SmartSeq2 Samples, Dropseq samples, etc., It is recommended to use the provided subcommands from the documentation for the easiest way to run Velocyto. These commands simplify the process by handling many configuration details automatically.

Advanced users seeking more flexibility should utilize the “velocyto run” command directly. This allows customization of the analysis pipeline to match specific experiment parameters. Additionally, users can adapt Velocyto to custom or new techniques by modifying the counting pipeline without a complete rewrite. Detailed guidance on creating new logic for analysis needs can be found in the API section.

For simplicity, we run velocyto on 10x Chromium subcommand as the sample that is going to be used for RNA Velocity Analysis was generated using the Cell Rangerpipeline. We will be providing examples for both— “velocyto run” and “velocyto run10x”

i) velocyto run

Please ensure that the paths for the below files are available:

barcodes.tsv,
output directory (where you want the .loom file to be saved),
the appropriate expressed repeat annotation,
the bam file, and
the genome annotation (.gtf file).

ii) velocyto run10x

Please ensure that the paths for the below files are available:

the appropriate expressed repeat annotation,
the cellranger sample output directory, and
the genome annotation file (.gtf file)

Please note the difference between the repeat annotation file and genome annotation file. They both are not the same and ensure that you use the appropriate paths. The genome annotation file(genes.gtf) is the one that is provided with the cellranger pipeline, and the repeat_msk.gtf is the repeat masker file that can be downloaded from UCSC Genome Browser.

If using the first option, use the below command:

velocyto run -b barcodes.tsv -o scvelo_out -m hg38_rmsk.gtf possorted_genome_bam.bam genes.gtf

If using the second option, use the below command:

velocyto run10x -m repeat_msk.gtf mypath/sample01 somepath/refdata-cellranger-mm10–1.2.0/genes/genes.gtf

For wider options and command line arguments, please refer to the official documentation.

Now that the commands are ready, we are good to submit a batch script to the server. Since we are making use of a HPC environment, we can also optionally add arguments to the run command(s) like the ones below to parallelize the process.

-@, — — samtools-threads: The number of threads to use to sort the BAM by cell ID file using samtools
— — samtools-memory: The number of MB used for every thread by samtools to sort the BAM file

Upon successful submission of the batch script to the server, we should now be able to see the below output being generated from the log file:

The process takes close to 2–3 hours and might vary depending on the file size and the computational power, but once done we get the below message and a .loom file generated!

Congratulations for successfully coming to the end of this Part 1 of the RNA Velocity Analysis tutorial and generating the counts successfully! If you encountered issues and were not able to generate the file successfully, please go through the common issues from the GitHub repository of velocyto:

Now that the .loom file has been generated, let us look at how this can be utilized by scVelo to add the spliced and unspliced slots to the adata object and carry out RNA Velocity analysis in the next part of this tutorial!

I extend my sincere gratitude to Professor Dr. James Cai, Dr. Selim Romero Gonzalez, and Shreyan Gupta for their invaluable assistance, encouragement, and profound motivation throughout the execution of this tutorial.

Stay tuned and thanks for reading!

RNA Velocity Analysis Part 1 — Generating Spliced and Unspliced counts

Written by Vishnuvasan Raghuraman