# t-SNE: The Bits that No One Learns

Machine learning has become a staple skill set for any data scientist or quantitative analyst. A big part of machine learning is feature engineering, which also consists of dimensionality reduction. This is a very necessary step in most cases because of the high dimension of the data acquired. High dimensional data has a few drawbacks:

- Computational power and time required to analyse high dimensional data is very high.
- The space of high dimensional data is ‘very large’ and hence the amount of data required to analyse it effectively becomes exponentially large. This is known as the curse of dimensionality.
- Performing robust statistical inference on high dimensional data hence is hard because of the relative lack of data.

This is indeed why there is a need for dimensionality reduction, that not only reduces the dimension of the dataset to make analysis easier, but also preserves the nature and structure of the dataset. A class of dimensionality reduction algorithms that exploit the geometry of the data is known as manifold learning. For an overview on manifold learning, do look into this article: Manifold Learning

In this article, I will be discussing in depth the t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm which is under manifold learning and is incredibly useful to visualize high dimensional datasets. It is my humble opinion that before using analysis tools, it is important to understand the theoretical background behind it and the limitations that results from the tool. Therefore, I will delve into the workings of the algorithm and look a bit into how it is implemented.

## t-SNE

T-distributed Stochastic Neighbor Embedding, or t-SNE as it is normally called, is a manifold learning algorithm that in essence constructs a probability distribution over the dataset, and then another probability distribution in a lower dimensional data space, making both the distribution as ‘close’ as possible (we will make this notion more rigorous later).

However, before looking into t-SNE, we will investigate SNE, which uses Gaussian distribution instead of Student-t distriution. The first step is constructing the probability distribution of the dataset in high dimensions. The probabilities in the dataset are constructed using the Gaussian distribution which amounts to

The intuition behind this construction is rather simple. The Euclidean distances between the pairwise points are considered to be normally distributed with mean of the point location and variance sigma, which the user defines.

Once the probability distribution in the high dimension is constructed, we construct a probability distribution in the lower dimensional space (usually 2 or 3 for visualization purposes) using the following formula for pairwise points y.

Now the goal is to make sure the two probability distributions are as similar as possible. This is achieved by considering the Kullback-Leibler (KL) divergence.

KL divergence is a measure of how different two probability distributions are from one another. The KL divergence is defined as

In essence, the KL divergence is the expected value of the log difference of the probabilities of the data points. Now that we have a notion of how to measure the similar/difference between distributions, we can optimize this by well known gradient descent methods.

However, there is a crucial disadvantage to using SNE. It is called the crowding problem and it is a result of the curse of dimensionality. Consider that you have some data points that have ten intrinsic dimensions, but represented in a much higher dimensional space and you want to reduce it to two dimensions. It is possible that if the number of data points is little, say 11, then there is no reliable way to map the data points in two dimensions.

Furthermore, when you map the high dimensional data in two dimensions, the area used by the data points in two dimensions that have high distances in the high dimensional data will not be sufficiently larger than the area used by the data points in two dimensions that have low distances in the high dimensional data. This is due to the fact that size scales up exponentially as dimensions increase. Consider the volume of sphere in n dimensions. It grows at a power of the radius. So, a sphere in n-dimensions will occupy much more space than a sphere in 2 dimensions, even with the same radius. So, the data in two dimensions tend to be ‘crowded’ together.

Since we are comparing the probabilities of the data points and not the actual distances, one natural way to overcome this issue is to consider distributions with fatter tails than Gaussian. One such distribution is the Student-t distribution. This works because when the tails are fatter, the distance of the points in the lower dimensional space is much bigger compared to when the tails are normally distributed.

To visualize it, consider the following. The z value of a cumulative probability score in the Student-t distribution is more than the z value of the cumulative probability score in the Gaussian distribution. So, the formula for the probabilities in the lower dimensional space now reduces to:

The optimization of the KL divergence is done using gradient descent algorithms that every data science enthusiast should be familiar with. However, there are some improvements made such as early exaggeration, which means the probabilities p are multiplied first early in the optimization, encouraging large values for p and fairly large values for q. This means that the space between clusters is maximized and is better for visualizations. Another improvement is adding the L2 regularization of the squared distances of the data, which promotes the clustering of data points.

## Conclusion

To recap, we first learnt about how the SNE algorithm works, by looking at how to construct the probability distribution in both datasets and minimize the KL divergence to get the most similar dataset in lower dimensions. Then, we discussed the issue of crowding when using Gaussian and overcame it with Student-t distribution. Lastly, we looked at how the classic gradient descent algorithm is improved using early exaggeration and early compression. Hopefully, this leaves you in understanding of how t-SNE works. To look into how to set the parameters of t-SNE effectively, look at this article.