Need for dimension reduction & high dimension data visualization

Published in

Musing’s of a Data Scientist in Medicine

7 min readJun 15, 2019

Today’s post is a follow-up on #SingleCell high dimensionality visualization.

Last week I had a very engaging discussion with a few of the fellow researchers and scientists about the need for t-SNE plots. The query was related to w.r.t t-SNE & whether they are the new bar graphs of new box plots. Query posted by @venkmurthy

The question can be found here: https://twitter.com/venkmurthy/status/1135243047980003329

Some amazing concept clarifications are posted in the thread for the general audience plus for users working with high dimensional datasets.

My replies are mostly in the context of single-cell datasets visualization since I came to learn other dimension reduction methods once I got exposed to single-cell data for exploratory analysis. Below are some of my thoughts summarized for future references.

What is dimension reduction?

It refers to viewing or projecting high dimensional data in a lower dimensional space in layman terms, and many methods can perform the same. A much lucid explanation can also be found in Wikipedia page. Quoting as it is: “In statistics, machine learning, and information theory, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration[1] by obtaining a set of principal variables. It can be divided into feature selection and feature extraction.” One such is t-SNE that came out in 2008 from Geoffry Hinton’s group. Some of its definition is best captured according to me here. The Ultimate Guide to 12 Dimensionality Reduction Techniques (with Python codes)

Another blog post that I enjoyed reading in this space as mentioned below gives the flavor of various ways to play around with t-SNE via R with added cluster models(e.g. k-means & hierarchical) that can be used in conjunction for better visualization of t-SNE plots.

“Playing with dimensions: from Clustering, PCA, t-SNE… to Carl Sagan!” by Pablo Casas

Dimension reduction can be unsupervised, supervised and semi-supervised. I will not cover details here but point to a presentation that covers limited rationale of it w.r.t to gene expression studies. There are pros and cons of all three and the usage is often problem-based. Unsupervised is often preferred to avoid the over-fitting/overinterpretation but having said that the data in hand and the problem in query is also important attributes of such a selection.

Source: http://rakaposhi.eas.asu.edu/ai/AI-lunch-ye.pdf (author: Jieping Ye)

2. What is t-SNE & why do we need it?

For any reference to t-SNE, I would request everyone to use this link “t-SNE decoded”as it is pretty lucid and clear decoding of the method, much better than I can give. In the context of single-cell, my understanding says t-SNE should help us to understand the non-linearity encoded in data. Now, this non-linearity can be very contextual based on the type of single-cell data being interrogated that conveys the transcriptional dynamics or kinetics. It is often most used in cell-type cluster assignments. However, depending on the datasets queried t-SNE often produces a blob(non-informative clusters), as cluster assignments by its own can lead to no meaningful inference nor convey any data properties. Having said that there were some interesting queries in the Twitter thread if t-SNE is always the best way visualization metric or not for high dimensional data viewing in low dimension space.

Source link: https://blog.datascienceheroes.com/playing-with-dimensions-from-clustering-pca-t-sne-to-carl-sagan/

Code Snippet available here (code not changed and posted as it is in the link, data to run the code is also available in the link):

# You can write R code here and then click "Run" to run it on our platform
library(readr)
library(Rtsne)
# The competition datafiles are in the directory ../input
# Read competition data files:
train <- read_csv("../input/train.csv")
test <- read_csv("../input/test.csv")
train$label <- as.factor(train$label)
# shrinking the size for the time limit
numTrain <- 10000
set.seed(1)
rows <- sample(1:nrow(train), numTrain)
train <- train[rows,]
# using tsne
set.seed(1) # for reproducibility
tsne <- Rtsne(train[,-1], dims = 2, perplexity=30, verbose=TRUE, max_iter = 500)
# visualizing
colors = rainbow(length(unique(train$label)))
names(colors) = unique(train$label)
plot(tsne$Y, t='n', main="tsne")
text(tsne$Y, labels=train$label, col=colors[train$label])
# compare with pca
pca = princomp(train[,-1])$scores[,1:2]
plot(pca, t='n', main="pca")
text(pca, labels=train$label,col=colors[train$label])
# Generate output files with write_csv(), plot() or ggplot()
# Any files you write to the current directory get shown as outputs

Add description

For more reference to the method, its evolution & real-world applications, I would encourage two below links:

t-SNE examples & definition from one of its developer Laurens van der Maaten
Application in real-world datasets e.g. Allen Brain Data

3. Is t-SNE the best dimension reduction method for projection?

There is no straight forward answer to that. It is certainly one of the ways but not the only one. My understanding stems from the fact that such visualization depends on data in hand and underlying data properties one wants to understand. It also means what queries one has w.r.t to the data that needs to be addressed based on a knowledge-driven hypothesis and how that query cross-talks with data-driven one. In layman terms does our data answers to our knowledge-driven hypothesis or does it convey something new? t-SNE is definitely one of the most sought after method in #SingleCell RNA-Seq data space for identification of cell type clusters. However, often such visualization is misleading and misinterpreted. This is something we should keep in mind while using it. Some interesting information about it can be found here. “How to use t-SNE effectively?” posted by @randyboyes in the thread of discussion.

4. Why do we need such methods for data visualization?

Our human mind and visualization are often restricted to 2D representation, however, data property/feature is not restricted to only X & Y axes when the sample size is lower than the features contained in it. Here features == dimensions. Features can be also referred to as the intrinsic properties of data. These are basic foundations that I learned while interrogating high dimensional data in data science. Data visualizations are concepts or methods where we try to extract these features to better understand the properties encoded in a given data. It is more than just computing basic statistics as the underlying premise is distance based mathematical calculations & probabilities. In the context of single-cell usually, our Sample Numbers (Ns) << Number of cell sequenced (Nc) << Dimensions(d) . So we try to understand the embedded data properties that inform us of underlying biological properties e.g. transcriptional stochasticity. To view such data we would need such him dimension reduction techniques, like PCA, t-SNE, UMAP, etc. (More detailed information can be found here)

5. What cluster detection approaches are there?

I will point to this link that is comparing various clustering approaches other than k-means & hierarchical that I have mentioned earlier. The other two in the link are HDBSCAN & Graph-based community detection. This should give some understanding of the cluster models and their performance. However, a few more systematic evaluation references can be obtained in the two paper links below w.r.t single-cell RNA-Seq data.

Paper 2: SC3 — consensus clustering of single-cell RNA-Seq data

Paper 2: A systematic performance evaluation of clustering methods for single-cell RNA-seq data

Clustering has always been a long lost issue in data visualization and can often lead to misinterpretations or misrepresentations. They can be supervised, unsupervised & semi-supervised. Hence, the underlying concepts need to be taken into account before any projection using particular high dimensional datasets. This also helps in clarifying the fitness of the query one is trying to address to an extent.

6. Finally, I end with: Pros & cons of using t-SNE.

Pros:

It is one of the methods to capture non-linear dependencies that are often missed by PCA.
It is better in large datasets for dimension reduction.
tSNE is based on usage of the local structure by where it maps points in high dim space to a lower dim preserving the distances between the points.
Other methods similar to it are such as Local Linear Embeddings, Kernel PCA, etc.
It is one of the best method crowding problems that is basically coming from “curse of dimensionality” (Ref: t-SNE & crowding problem)
It stands for “stochastic neighbors” meaning that there is no clear line between which points are neighbors of the other points.
Helps in identification of both global & local structure.
t-SNE uses the “early compression” that simply adds an L2-penalty to the cost function at the early stages of optimization. This is useful in keeping the distance between embedded points small
It also uses something called “early exaggeration”.

Cons:

It is non-convex, that means it has multiple local minima that can be fairly difficult to optimize.
It is not a linear projection but still makes some assumptions that the data local structure of the manifold is linear. This means distances between neighboring points are still measured in Euclidean distance (an assumption of linearity). This is a problem of complex manifolds where methods like Autoencoders might seemingly work better (untested at my end).

Edit1: Added a blog post that I missed in the first draft of the commentary. I will periodically update some interesting articles and blog post in this space for more clarifications. Please provide your feedbacks/comments so as to make it more relevant and lucid for understanding.

Edit2: Added a code snippet & image for comparison along with some new sources of t-SNE examples & applications with real-world data.

Edit 3: Frank Harrell suggested me to add some information w.r.t to unsupervised vs supervised dimension reduction in twitter, hence I added a bit of it in the definition of Dimension Reduction but the content is not heavily detailed in that aspect. I will mature it over time based on suggestion & need for clarification.

Need for dimension reduction & high dimension data visualization

Paper 2: SC3 — consensus clustering of single-cell RNA-Seq data

Paper 2: A systematic performance evaluation of clustering methods for single-cell RNA-seq data

Written by Vivek Das