Review — SNE: Stochastic Neighbor Embedding (Data Visualization)
High-Dimensional Data Mapping to Low-Dimensional Space
In this story, Stochastic Neighbor Embedding, SNE, by University of Toronto, is briefly reviewed. Obviously, this paper is by Prof. Hinton. In this paper:
- A probabilistic approach, by pairwise dissimilarity, using Gaussian, is used to map the high-dimensional data distribution on the low-dimensional space, e.g. 2D space, for data visualization.
- Cost function, a sum of Kullback-Leibler divergences, is minimized using gradient descent.
This is a paper in 2002 NIPS with over 1400 citations. (Sik-Ho Tsang @ Medium)
- Basic Stochastic Neighbor Embedding (SNE)
- Experimental Results
1. Basic Stochastic Neighbor Embedding (SNE)
The aim of SNE is to map the high-dimensional data to low-dimensional space for data visualization.
A good mapping approach is to obtain the distance between two mapped data points in low-dimensional space, which can reflect the distance between two data points in high-dimensional space.
So, when people see the distance between 2 points in low-dimensional space, they can have idea of distance between 2 points in high-dimensional space, i.e. data visualization.
- For each object, i, and each potential neighbor, j, we start by computing the asymmetric probability pij, that i would pick j as its neighbor:
- The dissimilarities, dij², can be computed as the scaled squared Euclidean distance between two high-dimensional points, xi, xj:
- where σi is either set by hand or by binary searching.
- In the low-dimensional space, Gaussian neighborhoods are also used but with a fixed variance:
- The calculation of qij is similar to that of pij but qij is in low-dimensional space, and yi, yj, yk are those data points in low-dimensional space.
- The aim of the embedding/mapping is to match these two distributions as well as possible.
- This is achieved by minimizing a cost function which is a sum of Kullback-Leibler divergences between the original (pij) and induced (qij) distributions over neighbors for each object:
Therefore, no matter the data points i and j are close to each other or far away from each other, when pij is close to qij, pij/qij is close to 1.
Taking the log, log(pij/qij) is close to 0, the cost for that i and j is 0.
If all the points mapped correctly, the total cost C will be small.
- So, there is a cost for modeling a big distance in the high-dimensional space with a small distance in the low-dimensional space.
The intuition is that, while SNE emphasizes local distances, its cost function cleanly enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.
- Differentiating C, the result is:
- which has the nice interpretation of a sum of forces pulling yi toward yj or pushing it away depending on whether j is observed to be a neighbor more or less often than desired.
- The embedding is initialized by putting all the low-dimensional images in random locations very close to the origin.
2. Experimental Results
2.1. Images of Digits
- A set of 3000 digit bitmaps from the UPS database, with 600 examples from each of the five classes 0,1,2,3,4, is used.
- It is important to note that SNE was given no information about class labels.
- As shown in figure above, it quite cleanly separates the digit groups.
2.2. Word-Author Counts
- Each of the 676 authors who published more than one paper in NIPS vols.
- We can see Prof. LeCun and Prof. Hinton at the bottom of the figure.
- SNE seems to have grouped authors by broad NIPS field: generative models, support vector machines, neuroscience, reinforcement learning and VLSI all have distinguishable localized regions.
The paper also mentions about a full mixture version of SNE. The idea behind is that each high-dimensional object can have several different versions of its low-dimensional image. But I don’t go through it here. If interested, please feel free to read the paper.