Capture Cell Heterogeneity in Single Cell RNA-seq by Topic Modeling (Part One)

Yang Xu
3 min readOct 6, 2019

--

Motivation

Days ago, I saw a preprint manuscript on BioRxiv ( https://www.biorxiv.org/content/biorxiv/early/2019/01/30/534800.full.pdf), which applied topic modeling to analyze single-cell chromatin contact data. This study was inspired by another research that was published in Nature Methods ( https://www.nature.com/articles/s41592-019-0367-1). Both studies binarized sequencing signals (0 or 1) and used Latent Dirichlet Allocation (LDA) to extract potential topics. Instead of analyzing documents, they treated each cell as a “document” and each chromatin site or chromatin contact as a “word”. I started to think that if topic modeling can be used for single-cell ATAC-seq and Hi-C, it could be the same useful for analyzing single-cell RNA-seq data.

Thus, I started to implement topic modeling for single-cell RNA-seq. I first downloaded the example data from https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html. This dataset contains around 2,700 single cells, and the R package Seurat is also well-known tool for single-cell RNA-seq analysis. Choosing this example dataset will allow me to better evaluate how good or bad topic modeling for scRNA-seq, because computational analysis for scRNA-seq is quite well studied and developed.

Preprocessing

Different from scATAC-seq and scHi-C, scRNA-seq relatively return better output dataset, which is supposed to be not very sparse. So I didn’t binarize gene expression as the two researches did. However, I would argue some preprocessing is necessary. I realized this problem by simply using raw UMI counts as input for LDA model. It turned out all topics consist of the almost same gene set but different weights among topics. I wonder it might be 1) these genes are commonly expressed; 2) they are well sequenced. Thus, I turned to non-negative matrix factorization (NMF) which takes the tf-idf input. In this case, the raw UMI count is preprocessed into a tf-idf matrix that considers the word (gene) frequency given the document (cell) size.

Choice of the Number of Topics

I started to test how many topics can be enough to represent the true dataset, by inputting the number of topics from 1 to 50. It turned out the decomposed matrix w*h will be closer and closer to the true dataset.

I arbitrarily selected 20, and visualize the 2,700 single cells by three different ways of dimension reduction.

Clustering

After dimension reduction, I will conclude topic modeling is able to capture cell heterogeneity within this scRNA-seq data. I ran graph-based clustering to cluster these 2,700 single cells into 15 groups.

Results also show that each cell cluster is associated with different topics and each topic is contributed by different gene sets.

For example, Cluster 2 strongly enriches topic 3, but not in other cell clusters. It is intriguing to evaluate the w matrix that reflects the relationship between genes and topics (Updated soon).

Code Accessibility

https://github.com/ImXman/scRNA-seq_topic_modeling

--

--

Yang Xu

Incoming Machine learning scientist at Broad Institute. Back to medium recently. Twitter: @Yang_bio