Understanding Single Cell RNA Sequencing

Part 1- Theory and the concepts

Deepti Saravanan
The Research Nest
8 min readJun 6, 2021

--

The most fascinating aspect of Computer Science is that it gels well with multiple disciplines and aids in their respective research works. One such broad field is Computational Biology.

Computational Biology is the science of developing algorithms and training mathematical models to capture and analyze the different kinds of interconnections in biological data that might easily miss human eyes.

There are many applications ranging from training neural networks to understand protein patterns, to automated trajectory construction of the development of stem cells to terminal cells using the RNA data at different stages of cell development. The latter is called single-cell RNA-Sequencing and will be discussed in detail in this two-part blog. Part 1 briefly introduces the basic concepts involved while Part 2 discusses the code version in detail, implemented on COVID-19 Data. Let’s dive into the theory now!

APPLICATIONS

Single-cell sequencing is an ongoing research topic with many exciting results. A few potential applications include:

  • Understanding the development of hematopoietic cells (from stem cells to various terminal cells such as white blood cell types).
  • Modeling the erratic behavior of cancer cells concerning any peaks or interesting changes in the RNA pattern unlike healthy cells
  • Analyzing the genetic pattern and behavior of newly emerging diseases during pandemics, like the ongoing one.

Isn’t this amazing? With no further ado, let’s learn some basics!

DATA PREPROCESSING

Data Preprocessing is an essential and important step in any machine learning application. Since we will be dealing with biological data here, the pre-processing steps are unique to this kind of data. Not to worry, these are pretty interesting and easy to understand.

1. FastQC

A typical single-cell sequencing data will be in the form of CSV (comma-separated values). The row headers represent the cell names and the column headers are the RNA labels. The corresponding values denote the expression values of those genes in the respective cells, as shown in the figure below.

These expression values are obtained by various reads performed by a biologist. The result of any model is highly affected by the quality of the input data. Hence, it is imperative to perform quality checks on the data available. There are a few options available that can be used for the same.

The most famously used procedure is the FastQC one (QC stands for Quality Checks while Fast may not imply that the process is quicker xP). It takes sequencing data as input and returns a report on read quality. The link to the website that provides more information on the factors whose quality are checked and the different versions available is here — FastQC A Quality Control tool for High Throughput Sequence Data. Based on this report, we can decide to eliminate the part of data corresponding to the cells that affect the overall quality.

2. Data Visualization

After step 1, good quality data is available to us. Great! The next step is to try to understand what the data represents and the various relationships exhibited. The best way to analyze is via various visualization tools that can beautifully exhibit the patterns observed within the data. A few preprocessing steps include -

  • Filter cells — Filter cell outliers based on counts and numbers of genes expressed.
  • Filter Genes — Filter genes based on the number of cells or counts.
  • Principal Component Analysis (PCA) — Calculates the PCA coordinates based on the variance in the gene expression data.
  • Normalization — Normalize counts per cell.
  • Recipe_weinreb17 — Helps in normalization and visualization of high-dimensional expression data.
  • Neighbors — Compute a neighborhood graph of observations.

3. Downstream Analysis

We have now reached the most interesting part of the field! With the pre-processed data from the previous step, several biological analyses could be performed based on the problem statement decided. For our topic in discussion, various recently proposed models use statistical concepts to put together the pattern puzzle in an interesting manner. The recent models proposed that provide SOTA performance include PAGA (Partition-based graph abstraction) and TotalVI (Variational Inference).

PAGA treats every cluster of cells as individual nodes and estimates the edge weight between these nodes (clusters) that could be used to analyze the various relationships and similar characteristics within and across clusters. While TotalVI estimates the posterior distribution of given data using the prior distribution followed by the expression data (both genes and protein data). Here, the posterior distribution represents ‘the biological state’ of the cells in the latent space that could be modeled to understand the characteristics exhibited by these cells. The novelty of TotalVI is that the model combines both RNA and protein expression data via joint probabilistic modeling, hence widening the scope to connect transcriptional variation with cell phenotypes and functions. Here, we will be discussing in detail the working mechanism behind PAGA.

PAGA

Consider a supermarket where the items are grouped together at different places, like canned food, baking ingredients, chocolates, vegetables, etc. Are these groups arranged randomly or in a particular order depending on the similarity between them, say with respect to usages like chocolates and cookies? Is it possible to automatically identify the relationship between different groups of items and place similar ones next to each other such that the probability of a customer buying an item from one group is higher if he/she is buying items belonging to the group adjacent to it? This process of identifying the relationship between clusters is exactly what PAGA explores in a biological setting.

Partition-based graph abstraction (PAGA) generates a topology-preserving map of cells at multiple resolutions. It preserves both continuous and disconnected structures of given data. The cells after clustering are represented as nodes and the weighted edges between them represent a statistical measure of connectivity between partitions.

Combining high-confidence paths in the PAGA graph with a random-walk-based distance measure on the single-cell graph, the cells are ordered within each partition according to their distance from a root/stem cell. A PAGA path then averages all single-cell paths that pass through the corresponding groups of cells. This allows tracing gene expression changes along complex trajectories at single-cell resolution.

PAGA is useful for Hematopoiesis analysis. Hematopoiesis is the production of all of the cellular components of blood and blood plasma. Monocytes are a type of leukocyte or white blood cell. Neutrophils are a type of white blood cell that helps heal damaged tissues and resolve infections. Basophils are groups of white blood cells containing granules full of enzymes. PAGA captures features such as the proximity and connection between roots of two similar blood cell types.

The connectivity is measured using p-test statistics for two different cases — graphs with constant and arbitrary degree distributions. The concept of Feature-Space based connectivity is used to determine how thick the connected region space is. Random walks on graphs are done using the Markov Process (this method is undirected and independent of specific gene expressions) and Normalized Laplacian (this method uses reference distribution of single-cell paths to calculate eigenvectors). I have attached the corresponding link in the reference section for more detailed mathematical proofs and analysis.

The execution steps for PAGA involve preprocessing the data, clustering, graph construction, and analysis. The steps with code will be discussed in detail in Part 2. I have attached the following figures for hematopoiesis to give a brief idea about the process on the whole.

The figure illustrates the embedding of cell types of hematopoiesis after preprocessing and clustering. The legend represents the cell types where stem represents the root cell cluster, no_gate represents the transitional cells while the rest represent the terminal blood cell types.

The adjacent figures represent the different clusters identified and their corresponding cell type annotations if the clusters represent terminal cells. The first part shows different clusters in different colors while the second one color-codes only the terminal cell type clusters.

The next step is the PAGA Graph construction as shown in the figures. The RHS is the graph constructed for the clusters embedding in LHS. The edges connecting the clusters define various paths of stem cell development and the corresponding gene changes and characteristics.

With the PAGA graph constructed, we can calculate the pseudotime of different cells from the root cell cluster which could be used to construct the trajectory from the stem cell to the terminal cells. The figure shows the heatmap of the calculated pseudotime ordering of the cells with respect to cluster 6.

Another kind of analysis involves visualizing the changes in the expression of various genes along the development path of a terminal cell type from a stem cell. The figure shows a heatmap representation capturing the same along the development paths of three important blood cell types. This aids in understanding the behavior and characteristics of cell types (like cancer cells) and how the root cells branch out into unique terminal cell types.

MORE TO EXPLORE

Computational Biology is an emerging field full of exciting opportunities and topics to explore yet. The topic we discussed today is pretty simple yet powerful. One amazing characteristic of this field is that with the various patterns we extract from these wide ranges of analyses, there is no fixed scope of research. It encourages out-of-the-box whacky ideas, a lot of which might lead to a big boom in rightly questioning and understanding the natural processes. So yeah, if you folks are intrigued by the topic we discussed in this part, feel free to explore more. A few topics to give you a headstart -

Part 2 will discuss the coding version in detail, implemented using the COVID RNA dataset.

--

--