The Underlying Idea of t-SNE

Ruben Stefanus
Data Folks Indonesia
8 min readOct 30, 2020
t-SNE in 3D - [6]

Outline

  • A Brief Overview of Dimensionality Reduction
  • Explanation of SNE
  • SNE Problem
  • Explanation of t-SNE
  • t-SNE Strength
  • t-SNE Weakness
  • Conclusion
  • References

What is Dimensionality Reduction?

Dimensionality reduction aims to map the data from original dimension (high dimension) to lower dimension space while minimizing information loss — [3] “Dimensionality reduction with t-SNE”, Zaur Fataliyev

Why we need it?

  • Feature selection or feature engineering
  • Detecting intrinsic dimensionality
  • It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D
  • Lower memory footprint
  • Reduces the time and storage space required

General Approaches

General Approaches — [3]

What is manifold? Read this article for concept understanding
Notes: We live in manifold!

  1. Projection: projecting high dimensional data into lower dimensional space (linear mapping). Example: PCA, LDA, NMF.
  2. Manifold Learning: modeling high dimensional data into lower dimensional space (non-linear mapping). Example: Isomaps, Autoencoder, t-SNE.

Taxonomy of dimensionality reduction techniques

Dimensionality Reduction Techniques — [3]

History

History — [4]

Example of high dimensional data (MNIST image dataset)

MNIST image

Visualization PCA on MNIST

PCA on MNIST dataset

Why PCA fails to properly reduce the dimensions of MNIST? We know PCA is good, but it’s a linear algorithm, meaning that it cannot represent a complex relationship between features. And PCA is only focused to place dissimilar data points far apart in a lower dimension representation.

Before we learn t-SNE, we should first study SNE which is previous work and development. SNE created and published in 2003 by Geoffrey Hinton and Sam Roweis — [1].

Stochastic Neighbor Embedding (Underlying idea of SNE)

SNE will measure pairwise similarities between high-dimensional and low-dimensional objects.

Illustration of high dimensional and low dimensional data points — [3]

It’s started by converting the high-dimensional Euclidean distances between data points into conditional probabilities that represent similarities.

Similarity of data points in High dimension:

p j|i equation — [2]

Similarity of data points in Low dimension:

q j|i equation — [2]

Both equations look the same right? The difference is in Low dimension (q) we set variance (σ) to (1 / √2).

Cost function (Kullback-Leibler Divergence)

KL divergence — [3]

KL is used to measure the similarity between two probability distributions and it’s asymmetric. The picture above visualizes the result of KL divergence between some of the PDFs.

In general, the cost function is used to measure model performance, and we want to minimize the cost function which is express p and q similar.

Cost function (based on KL divergence) — [2]

If p = q (equal) then it’ll be log(1) = 0, so cost function becomes zero.
Let’s see if p and q not equal :

  • Large p value modeled by Small q value: Big penalty value of cost function
  • Small p value modeled by Large q value: Small penalty value of cost function

Because of this reason, KL divergence is known as asymmetric.

SNE Gradient Calculation

What is the gradient?
In machine learning, we are trying to get an optimal solution (global minima). The gradient is a vector made up of model parameters that point in the direction of the steepest ascent from a given point on the loss function. The equation is expressed by a partial derivative of the cost function on y.

Gradient equation — [2]

SNE Problem

(1) The cost function that difficult to optimize

Cost function (based on KL divergence) — [2]

We already know that KL divergence is asymmetric, and we calculate conditional probabilities where pj|i not equal with pi|j
which give more complicated gradient calculation

(2) Crowding problem

It is a condition when we model high-dimensional data in low dimensions (2D or 3D). It is difficult to segregate the nearby data points from moderately distant data points and gaps can not form between clusters.

Analogy: In high dimension, we have more rooms so points can have a lot of different neighbors. But in low dimension, we don’t have enough room to accommodate all neighbors.

Illustration:
Suppose we want to map 4 equidistant points from a 3D plane into 2 dimensions. The first point is, can we have 4 equidistant points in 3 dimensions? Surely we can.

Consider the following set of points :

Let’s calculate the Euclidean distance between these 4 points,

Euclidean Distance

We will get the distance between these 4 points = 6,
But can we have 4 equidistant points in 2 dimensions? The answer is NO.
Because in fact, only 3 points in 2-dimensional space can be equidistant
ex : {(0,0), (0,6), (6,0)}

For n-dimensional space, maximum numbers of equidistant points = n+1

t-SNE is created and published in 2008 by Laurens van der Maaten and Geoffrey Hinton. It is known as improvement of SNE which is solved two problems faced in SNE

t-Distributed Stochastic Neighbor Embedding (t-SNE)

To address the SNE problem on the cost function, t-SNE is using “Symmetric SNE”.
The idea behind “Symmetric SNE” is to use joint probability distribution instead of using a conditional probability distribution

SNE Cost Function — [3]
t-SNE Cost Function — [3]

It’s symmetric because pi|j = pj|i and qi|j = qj|i , t-SNE use this because more robust to outliers and has a simpler gradient expression.

While to address the SNE problem on the crowding problem, t-SNE is using “Student-t distribution”. Instead of Gaussian, t-SNE use a heavy-tailed distribution (like Student-t distribution) to convert distances into probability scores in low dimensions (q). This way moderate distance in high-dimensional space can be modeled by larger distance in low-dimensional space and cost function is easy to optimize.

Scenario:

  • For tasks in which the dimensionality of the data needs to be reduced to a dimensionality higher than 3, then Student t-distributions with more than one degree of freedom are likely to be more appropriate.
  • For tasks in which the dimensionality of the data needs to be reduced to 2D or 3D, it’s good to use one degree of freedom as a parameter.
Gaussian vs Student
Student’s t-distributed SNE — [4]

The graph above proves that using Student-t distribution is good to represent that dissimilar objects in high-dimensional by a larger distance in low-dimensional space.

t-SNE Gradient Calculation

t-SNE Gradient — [2]
Gradient Descent Calculation — [2]
Parameter description — [2]

For Derivation of t-SNE Gradient, you can check in Appendix A in a paper “Visualizing Data using t-SNE”, L. Maaten, et. al — [2]

How we calculate variance (σ) in P?

First, we must specify a value called perplexity (a measure of the effective number of neighbors). And then, binary search is performed to find variance (σ) which produces the P having the same perplexity as specified by the user.

The perplexity is defined as:

  • Low perplexity = Small σ²
  • High perplexity = High σ²
  • Recommendation: perplexity values between 5–50 usually work well
Effect of the perplexity — [5]

Pseudocode for t-SNE algorithm

Pseudocode t-SNE — [2]

Visualization t-SNE on MNIST

The graph below shows that high dimensional data (MNIST image dataset) can be visualized in 2D with farther distances between digit clusters and well separated.

2D visualization with t-SNE on MNIST dataset

t-SNE Strength

  1. Works well for Non-Linear data
    It is able to interpret the complex relationship between features and represent similar data points in high dimension to close together in low dimension.
  2. Preserves Local and Global Structure
    t-SNE is capable to preserve the local and global structure of the data. This means, points that are close to one another in the high-dimensional dataset, will tend to be close to one another in the low dimension.

t-SNE Weakness

  1. Dimensionality reduction for other purposes
    (ex: BAD for feature selection/feature extraction because based on probability distribution) -> only for visualization!
  2. Curse of intrinsic dimensionality (sensitive to intrinsic dimension)
    Perform badly if high dimensional data actually have high intrinsic dimension
  3. Non-convexity of the t-SNE cost function (several optimization parameters need to be chosen)

Conclusion

The t-SNE algorithm provides an effective method to visualize a complex dataset. It successfully uncovers hidden structures in the data, exposing natural clusters and smooth nonlinear variations along the dimensions — [7]. The time complexity for t-SNE is O(N²), and then Barnes Hut SNE is created and published in 2014 to improve time complexity become O(N log N).

References

About me: I’m also active as AI Researcher at Jakarta Artificial Intelligence Research, it’s a research community that aims to implement and develop AI in Indonesia. You can check us on this and support us through this :D

Thank you for reading this article ~
If you like this article, just give claps and share!
Don’t forget to connect with me

--

--