Dimensionality Reduction and Visualization - Part-4 - tSNE

Deepak Jain
Applied Machine Learning
4 min readJun 15, 2020

This is the last part of the 4-part series on Dimensionality Reduction & Visualization.

Here is the link to Part-1, Part-2 and Part-3 of the series. In case you haven't read them, I would suggest you go through them first.

In this article, we will understand one of the most widely popular dimensionality reduction technique called t-Distributed Stochastic Neighbourhood Embedding (t-SNE). Now don't be scared after reading the name 😊. Trust me the technique is exactly what the name suggests. We`ll understand each and every word of the name.

Now you might wonder why do we need to study another dimensionality reduction technique? We just finished understanding PCA. So, PCA comes with its own set of limitations. We will overcome these limitations in t-SNE. Also, today, t-SNE is considered a “state-of-the-art” technique in the field of dimensionality reduction. It is one of the best techniques available.

In this article, we will cover the limitation of PCA, followed by how t-SNE works (Geometric intuition) and lastly the limitations of t-SNE.

Limitation of PCA:

If you remember the core concept of PCA, we want to reduce the number of dimensions retaining the direction of maximum variance. However, in doing so, we tend to lose information. Consider the below example:

Image by author

In this case, our data points form a circular representation. So, no matter in which direction we plot our new axes, we tend to lose information. In the example above if we take x1 and x2 as our axes (red lines), and convert this 2-D representation to 1-D then we tend to lose 50% of the information. Now, no matter how you draw the axes to retain the maximum variance, you will lose that 50% information.

Consider another example:

Image by author

In this case, we have more spread on v1 compared to v2. If you observe, we have 4 clusters of points. The points of cluster A & B when projected on v1 will get highly concentrated in a smaller region and we will lose information.

PCA: Tries to preserve the global shape/structure of data. It doesn’t care about the distance between points. It only cares about the direction which maximizes variance.

t-SNE: Preserves that local structure

How t-SNE works?

Now that we have understood the limitations of PCA. Let's dive into t-SNE and understand its working.

As mentioned earlier, t-SNE stands for “t-Distributed Stochastic Neighbourhood Embedding”. We will first understand the terms Neighbourhood and Embedding.

Image by author

In the above image, the “neighbours” of point xi (marked in red) are the points surrounding it (circled). We define neighbourhood of the points based on how close they are geometrically. Distance between the point xi and neighbour (say xj) can be given as:

Dist = ||xi — xj||

That’s the concept of neighbourhood. Next, we need to represent this 2-D data into 1-D.

As seen in the above diagram, when we project the data points from 2-D to 1-D, using t-SNE, the points which are grouped together or points which form a neighbourhood are preserved together. However, it does not guarantee to preserve the distance between 2 different clusters. What TSNE does while reducing to lower dimension (1-D in this case) is that it “embeds” the points into lower dimension while preserving the distances like the higher dimension space.

t-SNE is not a deterministic algorithm. It’s a probabilistic or stochastic algorithm. Now, what does deterministic and probabilistic algorithm mean?

When we give an algorithm some data and it returns the same output every time then it’s called a deterministic algorithm. And every time if we get slightly different output then its called probabilistic or stochastic algorithm.

In simple terms, t-SNE uses a probabilistic approach in both higher and lower dimension to preserve the neighbourhood. Although due to the curse of dimensionality the points in higher dimensional space tend to get crowded in lower-dimensional space, causing the “crowding problem”. To solve the crowding problem TSNE uses the student’s T-distribution. T-distribution tries its best to preserve the neighbourhood by means of a probability (Stochasticity). That’s why it is T-Distributed and Stochastic.

So we have understood t-SNE (each and every term) from a geometric point of view. Here is a link to a brilliant site which allows us to perform t-SNE in a very interactive fashion. The creators of this site are considered some of the best brains in the field of machine learning and deep learning. I would suggest you go to this link and play around with the visualizations and tunning.

Conclusion:

In conclusion, we have covered the following concepts in this series of articles:

  • In Part-1, we understood the concept of Dimensionality Reduction & Visualization -link
  • In Part-2, we learnt Principal Component Analysis (PCA) in great detail - link
  • In Part-3, we saw an alternate formulation of PCA (Distance minimization) — link
  • In Part-4, we came across one of the state-of-the-art technique called “t-Distributed Stochastic Neighbourhood Embedding” (t-SNE)

If you have any queries/suggestions feel free to respond below in the comments sections.

--

--