I do not understand t-SNE — Part 1

or my quest to understand t-SNE

t-SNE (or t-distributed Stochastic Neighbor Embedding) is a data visualization technique, which maps the high dimensional data to some low dimensions for us (puny) humans to visualize. While there are many guides and tutorials available about t-SNE which illustrates its importance, provide information on how to use it, and offer some intuition about it’s working, very few actually attempts to explain the underlying nitty-gritty details of this algorithm. With this post, I am trying to aggregate information from the original paper and various resources, that each explains a part of t-SNE, into one place for my future self and others.

This post is divided into two parts, partly due to overwhelming information in t-SNE and partly due to my inability to come up with easy explanation for this algorithm, right now. The first part explains the SNE algorithm (which was also given by Geoffrey Hinton) and the second part explains the improvements introduced by t-SNE algorithm over SNE.

SNE - Explained (Cliché)

The basic idea behind SNE is to minimize some cost function, that describes the divergence of our (low-dimensional) representation to the actual (high-dimensional) data, using the good old fashioned Gradient Descent. Or simply stated, a cost function that represents the error in our representation. That cost function is the Kullback–Leibler divergence (KL Divergence).

The next question that comes to mind is, how do we represent some random distribution using some mathematical function? That question is answered using Shannon Entropy. Shannon Entropy is a way to encode information in form of simple bits. So, if there is some difference in information between our representation and the actual representation, plugging Shannon Entropy into KL Divergence should tell us that. Minimizing the KL Divergence, therefore, reduces the error of our representation.

Okay, so, how does Shannon Entropy encodes distribution into bits? Shannon Entropy uses Probability Mass Function (or discrete PDF) and encodes the probability of each event into bits, which can then be averaged for complete distribution. It’s easy to convert a random distribution to some probability distribution using various PMF functions such as Normal, Cauchy etc.

And that’s it, these are all the components required for understanding this algorithm. Let’s dive in the details of each component.

Probability Function

Dataset with 2 distinct groups

Let’s take the example of a 2-D distribution. Pick a point out of the blue category and call it point i. Now using the Normal distribution equation, we can encode the distance of every other point, j, in form of probability that the point belongs to the same category as that of point i. Farther a point from i, lesser is the probability that the point belongs to the same group.

But, the actual probability function used in SNE (for the data that is to be represented) is

Conditional Probability in SNE

instead of

Normal Distribution Function

Now, why is that? Notice in the above dataset image that one category have much higher density than the other one. The conditional probability function, which just takes ratio of probability of choosing point j w.r.t. point i and the sum of the probability of every point w.r.t. point i, makes sure that the local distribution of each data is represented equally, regardless of its density.

One missing piece of the puzzle is the choice of “σ”. It is calculated using something called perplexity, which will be detailed after Shannon Entropy.

Similarly, the conditional probability function chosen for the low-dimensional map (that we need to optimize) is

Conditional Probability low-dimensional map
The variance (σ²) is taken as 0.5 here, which is an accepted value that was suggested in the original paper.

Shannon Entropy

Let’s first quote what Wikipedia says about entropy

Information entropy is defined as the average amount of information produced by a stochastic source of data.

To understand the above statement completely, take example of a message sent from a rover on Moon -

“The composition of unearthly liquid on the moon’s surface contains less than 10% water.”

Now a mathematical function can’t make much sense of that message. It needs it in some numerical form and that’s where information enters. It encodes this message into the amount of information that the message gives.

Continuing the example, let’s suppose that scientists already knew that the water composition on the moon’s surface accounts for some integer value between 5–9%. Then, the message transmitted by the rover is of zero use to the scientists, wastage of precious interplanetary communication bandwidth. Stupid rover!

We can interpret this in form of probability. Due to prior knowledge, the probability for an answer below 10% is 1 and hence, the message gives scientists no additional information at all. So, higher the probability of the occurrence of an event, lower is the amount of information. Therefore, we can expect the information content to be inversely proportional to the probability of occurrence of an event.

To clarify the previous point a little more, suppose that scientists knew that the water contents lie in 7–11% and hence, the probability of water contents being less than 10% is 0.8. If the scientists knew that the water contents are somewhere in between 9–13%, the probability becomes 0.4. What if the prior knowledge is that it is between 10–14%, then scientists would have narrowed down the possibility to one number (10%), based on prior information and the new message sent by the rover.

Mathematically, all above looks like

Now, log is just used to encode the probability into number of bits (yeah, kind of Huffman Encoding). The number of bits is just a convenient format to represent the amount of information contained. More information regarding the choice of log function can be found here.

Now, Entropy is the expected value (or average) of bits required to encode each information being relayed or more formally, can be defined as

Note that while calculating Shannon Entropy for the low-dimensional data, the formula to be used will be summation over -p(x) * log(q(x)). This is because the probability of occurrence of the q(x) will still be p(x) in the original distribution, where log(q(x)) can be thought of as information content of the distribution in our representation.

Okay, we have successfully converted the information contained in the dataset distribution in form of bits. Now, let’s deal with the missing piece, Variance (or Perplexity).

Perplexity — The density of distribution around each point i differs, and hence a common “σ” can’t be chosen for calculating probabilities for all points. “σ” is expected to be lower in denser regions and higher in sparse regions. But, the total information contained by probability for each point should be same. So, we can choose some number of bits (Shannon Entropy), by which to represent our data. A more convenient way to do this is to choose a number called perplexity, which is defined as

So, perplexity is nothing more than the number of discrete levels defined by the Shannon Entropy (or the number of bits). The SNE algorithm is quite robust to the values of Perplexity and typical values lies in the range from 5 to 50, as mentioned in the paper.

My understanding of perplexity is by no means complete. So, take my word in this case, or in any other, with a grain of salt.

Now, once we have the perplexity value, we can find “σ” by solving the perplexity equation, using Binary Search.

Kullback–Leibler divergence

We have all the components that we need, what do we have to do to convert the Shannon Entropy into some cost function? One of the simplest thing possible is to calculate the difference in information of our representation and high-dimensional data and that is exactly what KL Divergence does. So, for each point KL divergence can be calculated as

Replace i with j in the above equation. I got this equation from Wikipedia’s page and was lazy enough to not write one myself.

So, for all points the KL Divergence (or cost) can be defined as

Total cost of low-dimensional representation

Final Points

So, there we have it. We have our cost function and we can minimize it using our favorite Gradient Descent Algorithm. Although, SNE works amazingly in practice, there are still problems that were addressed by t-SNE. So, keep an eye out for Part 2 of this post, in which I’ll discuss the improvements made by t-SNE over the SNE algorithm.

Thank You for reading.