t-SNE: t-distributed stochastic neighbor embedding

An overview of t-SNE as a dimensionality reduction technique

Amy Pajak
11 min readJun 27, 2023

Summary

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a dimensionality reduction tool used to help visualize high dimensional data.

It’s not typically used as the primary method for training a model. Instead, it’s often the first step for exploratory data analysis, visualization, and clustering. t-SNE provides an intuitive way to understand high-dimensional data by reducing its complexity, which can guide the selection and application of subsequent techniques for more detailed and focused analysis.

The 2D/3D mapping created by t-SNE allows us (as humans) to view if there are strong relationships and from there decide ourselves the best machine learning algorithm to apply to the data; such as clustering, classification, or deep learning i.e. to LEARN those relationships within the mapping and identify outliers. This can help to improve the performance of our ML algorithm and reduce overfitting.

Introduction — High Dimensional Data

t-SNE was introduced due to having lots of high dimensional data that practitioners want to visualize. For example:

  1. Financial data: stock prices, trading volumes, and economic indicators, can be represented as high-dimensional data sets.
  2. Medical imaging: technologies, such as MRI and CT scans, generate high-dimensional data with intensity values representing different features.
  3. Genomics: DNA sequences of organisms are represented as high-dimensional data sets with each gene or nucleotide base representing a dimension. The human genome has approximately 3 billion base pairs, which correspond to 3 billion features in the DNA.
  4. Image and video data where each pixel represents a dimension.
  5. Social media platforms with user profiles, posts, likes, comments, and other interactions.
  6. Text data, such as news articles, tweets, and customer reviews, can be represented with each word or token as a dimension.
  7. Robotics: generates data with sensory inputs from cameras, microphones, and other sensors used in their control systems.

High dimensional data is everywhere. We need to visualize and explore the complex data in a more intuitive and understandable way. t-SNE serves as a powerful tool to achieve this by effectively transforming the high-dimensional data into a low-dimensional representation without losing significant information.

MNIST Digits

A great way to explain t-SNE is to show how it works on the MNIST Digits dataset.

A Survey of Handwritten Character Recognition with MNIST and EMNIST (2019)

This is a publicly available labeled dataset of 60,000 28x28 grayscale training images of handwritten digits from 0–9, along with a test set of 10,000 images.

We will use this dataset to demonstrate different dimensionality reduction techniques.

Introducing Principal Component Analysis (PCA)

One of the first, and most popular, dimensionality reduction techniques is Principal Component Analysis (PCA) which was published in 1901 by Pearson, Karl et al. It is a linear technique which finds a linear projection, or a new representation, of the original high-dimensional data points onto a lower-dimensional subspace in a way to maximize the variance of the data i.e. preserve as much information as possible. These projected axes/directions are referred to as the principal components of the data.

View the notebook version here

If we visualize PCA on MNIST Digits the results will be similar to what we see above which is a visualization of around 5,000 images. They’ve been laid out in 2-dimensions where each point corresponds to a digit in the dataset and its color labels which digit the point is representing. What we see here is that PCA captures some of the structure of the data, for example the red points on the right form a cluster of 0’s, and the oranges on the left form a cluster of 1’s. This happens to be the first principal component! — so the main variation between digits is between the 0’s and 1’s. That makes sense in pixel values where 0’s and 1’s have very few overlapping pixels.

The second principle component is on the top of the visualization where we see 4’s 7’s and 9’s clustered which are slightly more similar in terms of pixel values, and on the bottom we’ve got 3’s, 5’s and 8’s clustered which are also more similar e.g. a 3 will have many overlapping pixels with an 8. So that’s our second source of maximum variation between the data.

We can see 4, 7, 9 and 3, 5, 8 have similar overlapping pixel structure

This is great but a problem arises when the data is unlabelled — as we can see on the right. The color/labels can tell us some information about the relationships in the data, without them we see no clear clusters but rather just many points in a 2D space. So we run into a problem with unlabelled data causing us to be unable to interpret these results.

Can we do better?

Linear vs Non-Linear Data

PCA is good, but it’s a linear algorithm.

  • It cannot represent complex relationships between features
  • It is concerned with preserving large distances in the map (to capture the maximum amount of variance). But are these distances reliable?

Linear techniques focus on keeping the low-dimensional representations of dissimilar data points far apart (e.g. 0’s and 1’s we’ve just seen). But is that what we want in a visual representation? And how reliable is it?

Dimensionality reduction with t-SNE (2018)

If we look at the swiss-roll nonlinear manifold above (a), we can see that a Euclidean (straight-line) distance between two points in the red and blue clusters would suggest that the data points are quite close.

If we consider the entire structure to be represented in a 2D plane (i.e. rolling it out into a flat 2D shape), the red points would actually be on the opposite end to the blue points — one would have to traverse the entire length of the roll to get from one point to the other. PCA would attempt to capture the variance along the longest axis of the roll, essentially flattening it out. This would fail to preserve the spiral structure inherent to the Swiss roll data, where points that are close together in the spiral (and thus should be close together in a good 2D representation) end up being placed far apart.

So we can see PCA doesn’t work very well for visualization of nonlinear data because it preserves these large distances and we not only need to consider the straight-line distance, but also the surrounding structure of each data point.

Introducing t-SNE

Stochastic Neighbor Embedding was first developed and published in 2002 by Hinton et al, which was then modified in 2008 to what we’re looking at today — t-SNE (t-Distributed Stochastic Neighbor Embedding).

For anyone curious another variation was published in 2014 named Barnes-Hut t-SNE which improves on the efficiency of the algorithm via a tree-based implementation.

How t-SNE works

Dimensionality reduction with t-SNE (2018)

In a high dimensional space (left) we measure similarities between points. We do this in a way so that we only look at local similarities, i.e. nearby points.

The red dot in the high dimensional space is xi. We first center a gaussian over this point (shown as the purple circle) and then measure the density of all the other points under this gaussian (e.g. xj). We then renormalise all pairs of points that involve the point xi (the denominator/bottom part of the fraction). This gives us the conditional probability pj|i, which basically measures the similarity between pairs of points ij. We can think of this as a probability distribution over pairs of points, where the probability of picking a particular pair of points is proportional to their similarity (distance).

t-SNE (t-Distributed Stochastic Neighbor Embedding) Algorithm

This can be visualized as such. If two points are close together in the original high dimensional space, we’re going to have a large value for pj|i. If to points are dissimilar (far apart) in the high dimensional space, we’re going to get a small pj|i.

Perplexity

Looking at the same equation, perplexity tells us the density of points relative to a particular point. If 4 points of similar characteristics are clustered together, they will have higher perplexity than those not clustered together. Now, points with less density around them have flatter normal curves compared to curves with more density. In the figure below, the purple points are sparse.

t-SNE (t-Distributed Stochastic Neighbor Embedding) Algorithm

We compute the conditional distribution between points as this allows us to set a different bandwidth (sigma i) for each point, such that the conditional distribution has a fixed perplexity. This is basically scaling the bandwidth of the gaussian in such a way that a fixed number of points fall in the range of this Gaussian. We do this because different parts of the space may have different densities, and this trick allows us to adapt to those different densities.

Mapping the lower dimensional space

Next we’re going to look at the low dimensional space which will be our final map.

[2]

We start by laying the points out randomly on this map. Each high dimensional object will be represented by a point here.

We then repeat the same process previously - center a kernel over the point yi and measure the density of all the other points yj under that distribution. We then renormalise by dividing over all pairs of points. This gives us a probability qij, which measures the similarity of two points in the low dimensional map.

Now, we want these probabilities in qij to reflect the similarities pij which we computed in the high dimensional space just before, as closely as possible. If the qij’s are identical to the pijs then apparently the structure of the map is very similar to the structure of the data in the original high dimensional space.

We will measure the difference between these pij values in the high dimensional space and the qij values in the low dimensional map by using Kullback–Leibler divergence.

Stochastic Neighbor Embedding

KL divergence is the standard measure of the distance between probability distributions / their similarity. Its shown below in the cost function as the sum over all pairs of points of pj|i times log pj|i over qj|i.

  • Similarity of data points in High Dimension
  • Similarity of data points in Low Dimension
  • Cost function
[1]

Our goal now is to lay out the points in the low dimensional space such that the KL divergence is minimized i.e. as similar as possible to the high dimension values. In order to do that we’re basically going to do gradient descent in this KL divergence, which is essentially moving the points around in such a way that the KL divergence becomes small

Mapping the lower dimensional space

[2]

Here we can see the resultant mapping which has been rearranged to be as similar to the higher dimension as possible.

KL divergence is useful as it measures the similarity between two probability distributions. It is also symmetric (distance xi -> xj is same as distance from xj -> xi).

t-SNE Algorithm

Let’s take a final look at the overall algorithm.

[1]

Combining what we’ve seen:

1. Calculate the pairwise affinities (conditional probabilities) in the high-dimensional space, using a Gaussian distribution. The perplexity parameter defines the effective number of neighbors each point has and helps to balance the focus on local and global aspects of the data.

2. Symmetrize the probabilities. This means that each point considers the other point as its neighbor, and it is achieved by taking the average of the conditional probabilities for each pair of points. Then normalize by dividing by the total number of points.

3. Randomly initialize the position of each data point in the low-dimensional space, usually by drawing from a normal distribution with mean 0 and small variance.

4. Start a loop that will iterate for a fixed number of times T. Each iteration updates the position of the points:

4.1. Calculate the pairwise affinities (similarities) in the low-dimensional space using a t-Student distribution.

4.2. Calculate the gradient of the cost function with respect to the position of the points. The cost function is usually the Kullback-Leibler divergence between the high-dimensional and low-dimensional distributions.

4.3. Update the position of the points in the low-dimensional space. This update step is composed of three parts: the old position Y(t-1), a term proportional to the gradient that helps minimize the cost function, and a momentum term that helps accelerate convergence and avoid local minima.

5. End of the t-SNE algorithm. The final positions of the points in the low-dimensional space should now provide a useful visualization of the high-dimensional data.

Visual output of t-SNE applied to 1000 MNIST Digits images

If you want to see this algorithm implemented in detail as code please take a look at the original authors github [3] or this great step-by-step article.

Alternatively…

“To deal with hyperplanes in a 14-dimensional space, visualize a 3D space and say ‘fourteen’ to yourself very loudly. Everyone does it.” — Geoffrey Hinton, A geometrical view of perceptrons, 2018

Advantages

  1. Visualization: t-SNE can help visualize high-dimensional data that has non-linear relationships as well as outliers
  2. Good for clustering: t-SNE is often used for clustering and can help identify groups of similar data points within the data.

Limitations

  1. Computational Complexity: t-SNE involves complex calculations as it calculates the pairwise conditional probability for each point. Due to this, it takes more time as the number of data points increases. Barnes-Hut t-SNE was later developed to improve on this.
  2. Non-Deterministic: Due to the randomness in the algorithm, even though code and data points are the same in each iteration, we may get different results across runs.

Demo

In the following notebook I use Python to implement PCA and t-SNE on the MNIST Digits dataset via the sklearn library:

https://colab.research.google.com/drive/1znYpKviaBQ7h0HgfACcVxnP26Ud1cwKO?usp=sharing

Conclusion and Next Steps

To conclude, t-SNE visualization is just the first step in the data analysis process. The insights gained from the visualization need to be followed up with further analysis to gain a deeper understanding of the data or captured by an appropriate ML algorithm to build predictive models, or statistical analysis methods to test specific hypotheses about the data.

Other popular dimensionality reduction techniques include:

  • Non-negative matrix factorization (NMF)
  • Kernel PCA
  • Linear discriminant analysis (LDA)
  • Autoencoders
  • Uniform manifold approximation and projection (UMAP)

You can read more about them here.

Resources

--

--

Amy Pajak

Machine Learning Engineer | NLP | Finance | Computer Science | All views are my own | https://twitter.com/PajakAmy