Implementing PAGA- A single-cell RNA sequencing technique on COVID-19 data

Part 2- Practical approach to RNA sequencing

Published in

The Research Nest

8 min readJun 13, 2021

Earlier in the first part, we discussed what single-cell sequencing is and what are the basic steps performed to make the biological data ready for downstream analysis. We also discussed PAGA — concept, steps, and application. In this part, let’s try implementing the same on the COVID-19 dataset and analyze and comprehend the results we get.

The link to the first part:

Understanding Single Cell RNA Sequencing

Part 1- Theory and the concepts

medium.com

The very first step is to import all the required libraries. The most widely used Python-based package that has all the tools for preprocessing, visualization, clustering, trajectory inference, and differential expression testing is scanpy. The link for the complete documentation page of the package could be found in the reference section.

Importing required packages

After the required packages are installed and imported, the next step is to read and pre-process data. Data is read in the form of anndata. Anndata stands for ‘Annotated Data’, which saves all the results obtained from tools used on the data under four attributes —

obs (annotations of observations)
uns (unstructured annotation)
var (annotations of variables/features)
X (data matrix containing the expression values from given input data).

All the annotations from tools majorly get stored in the obs and uns attributes. For instance, after clustering, the cluster numbers to which each cell belongs get stored in the obs attribute under the clustering header. And the data points computed using PCA, diffusion maps, etc. get stored in uns attribute under the respective headers. The uns data is retrieved whenever we plot graphs for downstream analysis.

It is a simple step. Say we need PCA coordinates, it could be accessed by the snippet adata.uns[‘pca'] . The figures below show a glimpse of how the anndata structure looks like.

Once the data is read into anndata format, we run a few pre-processing steps as shown in the below code.

Pre-processing data

The Weinreb tool normalizes the data which is then used to calculate the PCA coordinates that get saved in uns attribute of the anndata. Principal Component Analysis is a dimensionality reduction technique that allows us to visualize high-dimensional data along real axes called principal components. These components are calculated using the concept of eigenvectors (in simple words, the principal components are a linear combination of variables in the data). This helps to visualize the placement of cells in a 2-dimensional place and hence easily identify the nearest neighbors aka similar cells.

The next step is the neighborhood graph calculation, which would be used later when performing downstream tasks like clustering.

Using this information extracted and calculated, a simple layout of force-directed graph drawings is visualized to understand the topology of the given data. Then diffusion map coordinates (another dimensionality reduction technique) are calculated for the data and used to recompute and plot the graphical positions of the cells. Truncating at a few diffusion components amounts to denoising the graph — we just take a few of the first spectral components. We can even use PCA or tSNE (another tool to visualize high-dimensional data) coordinates instead.

The figure below shows the basic force-directed graph obtained. This could be further improved by coloring the cell clusters post clustering.

After pre-processing data, it’s time to implement PAGA. And before that, clustering is to be performed to define the nodes of the PAGA graph. There are several clustering algorithms available. I have used the Leiden clustering algorithm. The cluster information is saved in the obs attribute of the anndata under the ‘clusters’ header. It is a list of cell names and the corresponding cluster numbers. The sc.tl.paga code uses the PAGA tool that quantifies the connectivity of clusters (nodes) of the single-cell graph and generates a much simpler abstracted graph (PAGA graph) of clusters, in which edge weights represent confidence in the presence of connections.

The sc.pl.paga_compare line uses the full adjacency matrix of the abstracted graph and the weights correspond to confidence in the connectivities of clusters returned by the PAGA tool line. It presents the PAGA graph adjacent to the force-directed graph plotted earlier.

Using the PAGA coordinates calculated as additional feature information, the force-directed graph is plotted to improve the topology.

Clustering and PAGA implementation

The following figures represent the plot results obtained after running the above code.

It is possible to profile these clusters identified using the highly variant genes representing them. Here, profiling of clusters is the process of identifying the terminal cell types that the clusters represent. This could be achieved by analyzing the high presence of the marker genes of the cell types in each of the clusters. The highly variant genes that typically represent the characteristics of a cell type are called its marker genes. This information could be extracted from various existing biological resources and stored as the ground truth. By identifying the marker genes of all the clusters, we can compare them with the ground truth and annotate the corresponding clusters.

There are two widely used methods used for cluster-wise marker gene identification. They are the t-test and the Wilcoxon test. The main difference is that while the t-test tests whether the average difference between two gene expressions in a cluster is 0, the Wilcoxon test tests whether the difference between the two has a mean signed-rank of 0.

Cluster-wise Marker Genes Identification

The following figure shows the result of running the above code to extract marker genes cluster-wise. The list obtained from both the tests are stored in the obs attribute of anndata under ‘rank_genes_groups_ttest’ and ‘rank_genes_groups_wilcoxon’ respectively.

It is also possible to analyze the above results visually using heatmaps. Let’s use the umap tool for dimensionality reduction of data. Using the list of the most highly variant gene in each cluster, we can visualize the umap plot individually for each gene. If the cluster is represented by the color in the higher end of the heatmap, then the marker gene is significantly present in the corresponding cluster when compared to the rest. In addition, we could even analyze the similarities in cluster characteristics if the marker gene is significantly present in multiple clusters according to the heatmap.

Another tool to visualize gene expression per cluster is the stacked violin plot. Violin plots are a method of plotting numeric data and we can find the same information as in the box plots — median, interquartile range (describes the middle 50% of values when ordered from lowest to highest), and the lower/upper adjacent values (the smallest/largest observation that is greater/less than or equal to the lower/upper limit). One significant advantage of violin plots is that it shows the entire distribution of the data in addition to showing the abovementioned statistics.

umap plots of extracted Marker Genes

The following figures show the results obtained after running the above snippet.

Umap plots for the marker genes identified by the Wilcoxon test.

As can be seen in the umap plots, the marker genes are expressed significantly higher in one cluster compared to others, though all the cells are covid infected. Each of these marker genes code proteins that perform unique functions. For example, the ‘ENS00000125538’ gene codes protein that tackles inflammations while the ‘ENSG00000163884’ gene codes protein that regulates glucose and amino acid metabolism.

The reason behind why certain infected cells highly express one kind of gene when compared to others will be of greater significance for the pathologists to research more and better understand the characteristics of the viral infection. This is just one small example of how it is possible to accurately automate the extraction of key characteristics of the data in question to aid health researchers in solving the puzzle. One interesting idea that could be implemented is to run the same code in the gene expression data from healthy individuals if available and compare the results to identify the differences.

The above figure represents the stacked violin plot obtained that shows the marker gene expression changes across clusters.

Conclusion

Using COVID-19 data, we tried to implement PAGA, extract the information and visualize it. These visualizations ease the analysis of results when compared to raw data. After importing the required packages, we read the data in the form of anndata, a widely used data format for various biological implementations. Following this, we normalized the data and computed PCA to extract the neighborhood information from the two-dimensional representation obtained. We then visualized the data via a force-directed graph that was later enhanced using the diffusion map coordinates calculated.

Using these graphical points, clustering was performed. PAGA was implemented on these clusters, quantifying the connectivity (edges) between these clusters (nodes). To help profile these clusters to understand the connecting relationship between them, we extracted the marker genes from each cluster using two methods, t-test and Wilcoxon. These marker genes were visualized using heatmaps after dimensionality reduction using the umap method. The trend of these marker gene expressions in different clusters was also captured using violin plots.

Since the COVID-19 dataset is inclined more towards genetic characteristics of infected cells, the concept of trajectory inference of development of stem cells to terminal cell types is not applicable here. With the help of the scanpy documentation, I highly encourage you to use hematopoietic cell data and try to plot the different trajectories and analyze the marker gene expression changes along these paths. I have provided the link to the documentation and dataset in the reference section.

Mask yourself, get vaccinated, and stay safe!

References:

scanpy documentation — Scanpy — Single-Cell Analysis in Python — Scanpy 1.7.2 documentation
PAGA source code — theislab/paga: Mapping out the coarse-grained connectivity structures of complex manifolds.
COVID-19 dataset — swab_gene_counts_final.txt
Hematopoietic cells dataset — raw_data
Anndata documentation — anndata — Annotated Data — anndata 0.7.7.dev2+ge7a3a3e documentation

Implementing PAGA- A single-cell RNA sequencing technique on COVID-19 data

Part 2- Practical approach to RNA sequencing

Understanding Single Cell RNA Sequencing

Part 1- Theory and the concepts

Conclusion

Written by Deepti Saravanan