Filling the blanks of sc-Seq analysis with scCoGAP

Published in

Elucidata

4 min readJan 11, 2021

Let’s talk about the philosophy of genes. I am calling it philosophy as, by definition, philosophy is the study of beliefs about the meaning of life. There was a belief earlier that genes give rise to traits or even behavior. Thomas Hunt Morgan demonstrated that this belief is true. This should then make diseases, their causes, and other complicated traits very easy to figure out, right? However, it turns out that this is not a one-to-one relationship. Instead, many genes work together, sometimes differently in different biological contexts. High-throughput technologies generate tons of data nowadays. Wouldn’t that help in solving this conundrum? Not really, as researchers tend to look at each dataset individually and derive insights only from that particular dataset. If they were to compare across datasets, they would have to deal with technical batch effects, proving prohibitive. So how do researchers solve this problem? Well, they curate their insights and put their knowledge into databases that other researchers can refer to. Those curators are just humans too, however, and they can only look at a few datasets.

Let us step back a bit and consider this problem in the context of computer vision. Computer vision has accomplished something that humans could’ve done only a few years ago; machines can now develop a high-level understanding from images. If you’re presented with a picture of a cat and asked what the subject of the picture is, you’d be able to recognize features such as the face, nose, and whiskers based on prior knowledge and say it’s a cat. But for a machine, each picture is a huge pile of data, and separating the cat from everything else would be almost impossible. Advances in computer vision and the development of deep neural nets have helped overcome this. Neural nets can generalize the features of a face (such as a nose) or any object and apply those features to understand a new similar object. The image below depicts an explanation of how features look at each stage of CNN (convolution neural net). From the input, the neural net tries to learn filters that optimally explain the image. As we move from left to right, we can see facial features emerging until a fully reconstructed face is finally obtained.

This approach works well for natural language as well. Neural nets can pick up patterns embedded in the way we speak or write. The method is the same: to find features that represent these patterns and then use them to understand new inputs (in this case, sentences).

Coming back to biology, isn’t gene expression the language spoken by our genes? Yes, it is. In this case, the features that neural nets can learn are pathways or genesets. Unfortunately, these genesets have not been updated as new datasets come in, do not have much biological context, and there is no way that we can keep up if we do this manually.

Now let’s get to the paper. This paper, published in Cell Systems, solves some of the key issues we discussed earlier. Firstly, it removes or reduces technical noise to a few features that make datasets comparable. Secondly, it makes single-cell datasets across omics comparable, which makes integration of datasets possible. The authors do not use neural nets but apply the same principle by reducing genes to a small number of relevant features. The techniques used here, called scCoGAPS and ProjectR, are very similar to principal component analysis (PCA) or any dimensional reduction techniques.

One advantage is that scCoGAPS and ProjectR are scalable across multiple datasets and datatypes. Once you have a pattern matrix, you can also project your own data on that pattern matrix to understand your data in the data’s context to which the pattern matrix belongs. An example mentioned in the paper is when you project data from the developing brain to data from the developing retina, you can observe related processes in the neuronal data. All stem cell-related processes in the retina will map to similar stem cell processes in neurons, even though the two cell types are different. While this approach is not new, there was no single seamless published method to do this. You can test the concept out in genequery, where you can find similar datasets using results from a specific biological context. In the example on genequery, you should see that genes from a dataset related to hypoxia can bring up more hypoxic datasets and the datasets in which hypoxia is more commonly found (like in cancer and immune datasets). All you need to create these models is curated data.

Contributors: Manmeet S. Dayal, Malavika

Filling the blanks of sc-Seq analysis with scCoGAP

Written by shashank jatav