Understanding and applying t-SNE

Andrés Burruel
MCD-UNISON
Published in
7 min readMar 25, 2024

When working with dataframes, the number of variables or features that describe a row is usually know as the dimensionality. It is pretty common to have datatasets with high dimensionality and there are several challenges when analyzing them. When having these kind of datasets tasks like training machine learning models or simply gaining insights can be converted into a complicated task. In order to attack this problem, we have dimensionality reduction.

Dimensionality Reduction techniques are methods that convert higher dimensional data to low dimensional data, trying to provide similar or equivalent information. Among the most popular techniques we can mention, for example, principal component analysis (also known as PCA) and singular value decomposition (SVD).

In this occation, we will present the method known as t-SNE, which is based in other technique called Stochastic Neighbor Embedding (SNE). The objective here is to present the algorithm used in a summarized way giving, hopefully, a more understandable form of presenting it, all of this without going into a lot of mathematical details and presenting the fundamental ideas. In the end, the method will be applied in two simple data sets using Python, being the first one made artificially and the other one grabbed from the classic wine quality dataset from UC Irvine Machine Learning Repository. Let’s begin.

Stochastic Neighbor Embedding

First, suppose we have a set of high dimensional data X where each data unit has a vector representation, and that we want to transform it into low dimensional data Y. If we grab a vector i and enumerate the remaining points, given the vector i, j is a neighbor of i with a probability of

where || . || denotes the euclidean norm, and sigma a variance of data centered at that point. With this, is if the distance between points i and j is large or the given variance is small, the probability will be small, while if the distance between i and j is small and have a larger variance, we will have a bigger probability. The variance can be interpreted as an acceptance radius in which we restrict what point we consider as our neighbor. The smaller the variance, the smaller the “acceptance radius” is.

Now, imagine that we have already embedded the high dimensional data X to a low dimensional space, transforming it into Y. Then, if we have a mapped element i and enumerate the remaining mapped points, assuming certain variance, given i, j is a neighbor of i in this new low dimensional space with a probability of

Here, y_i is the mapped datapoint x_i. Because we want to preserve the same “shape” or the same “neighborhoods”, we can say that the mapping was correct if

for almost every i and j. In order to measure the difference between the two ways of assigning probability p and q, Stochastic Neighbor Embedding uses Kullback-Leibler divergence, which is a type of statistical distance.

Given a element i, if we got similar ways of assigning probability to the remaining points, then

Thus, if both ways of assigning probability are similar, then the total cost C given by the sum of the Kullback-Leibler divergences will be relatively small. With the previous concepts, it is now possible to know what SNE does in general terms. The objective of this first method is to generate a mapped dataset Y that optimizes the cost function C. One way is through gradient descent algorithm, where

t-SNE method

With SNE, we can now build t-SNE. Before that, we produce an adaptation of SNE called symmetrical SNE. In this new version, working analogously with the elements of the dataset, we define the joint probabilities

Notice that

Also, the cost function C transforms into

Finally, what makes t-SNE distinctive is a change in the function q_ij, using a Student t distribution with one degree of freedom.

All of this additional steps avoid the so called “Crowding problem”. In the end, in t-SNE we also want to obtain low-dimensional data that optimizes the cost C. Just as in SNE, we can use gradient descent to optimize C in t-SNE.

With all the above, t-SNE algorithm can be summarized with the following image, as it is shown in the original paper (taken from https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf):

Simple version of t-SNE

Some of the advantages of using t-SNE are:

  • Local neighborhoods or surrounding areas are very well preserved.
  • Outliers do not affect that much the result because of Kullback-Leibler divergence measure
  • It is not necessary to have a linear relationship between the data as it is required for other dimensionality reduction techniques.

Applying the method

Fortunately, implementing this method in Python is simple, since the sklearn library already contains the functions and methods necessary to apply it. As it was mentioned at the beginning, we will first use this technique on an artificially made data set. We now import the necessary libraries.

import pandas as pd
import numpy as np
from sklearn.manifold import TSNE

import plotly.express as px

Then, we generate and plot the data set in the following way.

X_1 = [[np.random.uniform(1,2),np.random.uniform(1,2), np.random.uniform(1,2)] for i in range(1000)]
X_2 = [[np.random.uniform(-2,-1),np.random.uniform(1,2), np.random.uniform(1,2)] for i in range(1000)]
X_3 = [[np.random.uniform(-2,-1),np.random.uniform(-2,-1), np.random.uniform(1,2)] for i in range(1000)]
X_4 = [[np.random.uniform(1,2),np.random.uniform(-2,-1), np.random.uniform(1,2)] for i in range(1000)]

X = np.array(X_1 + X_2 + X_3 + X_4)

df = pd.DataFrame(X, columns = ['x','y','z'])

X_x = df['x']
X_y = df['y']
X_z = df['z']

fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter(X_x, X_y, X_z)
ax.set_zlim(0,2)
plt.show()
“Pillar” data

We’ve basically generated four ‘pillars’ in a 3D space. The objective with this dataset is to first see how t-SNE embeds this in a two dimensional space. Now, we proceed to apply the method with sklearn and organize the new dataset in order to make a new 2D plot.

X_embedded = TSNE(n_components=2, learning_rate='auto',
init='random', perplexity=70, n_iter =1000).fit_transform(X)

DF = pd.DataFrame(X_embedded, columns = ['x','y'])

fig = px.scatter(DF, x = 'x', y = 'y')
fig.update_layout(width = 700, height = 700)
fig.show()
Embedded ‘pillars’ in a 2D space

Although it may seem intuitive that going from three to two dimensions is simply making a projection, it is worth clarifying that this is not the case. Again, what the algorithm tries to do is make the probability distributions that were previously established in the original space and in the low-dimensional space as similar as possible, taking the Kullback-Leibler divergence as a ‘distance’ or the main criterion.

Now, we will do the same but with the red wine quality data set from UCI repository. You can download it from https://archive.ics.uci.edu/dataset/186/wine+quality. First, notice that this dataset has 14 features, or, in other words, 14 dimensions.

X_wine = pd.read_csv('winequality-red.csv', decimal=',')
X_wine = X_wine.astype('float64')
X_wine.head()

Now, we are going to try to embed this high dimensional data into a 2D space.

X_wine = pd.read_csv('winequality-red.csv', decimal=',')
X_wine_embedded = TSNE(n_components=2, learning_rate='auto',init='random',
perplexity=70, n_iter =1200).fit_transform(X_wine)

df_wine = pd.DataFrame(X_wine_embedded, columns = ['x','y'])

fig = px.scatter(df_wine, x = 'x', y = 'y')
fig.update_layout(width = 600, height = 600)
fig.show()
Wine quality dataset embedded in a 2D space

Finally, let’s see what happens when embedding the data into a 3D space.

X_wine_embedded_3d = TSNE(n_components=3, learning_rate='auto', init='random',
perplexity=70, n_iter =1200).fit_transform(X_wine)

df_wine_3d = pd.DataFrame(X_wine_embedded_3d, columns = ['x','y','z'])

# Separating dimensions or coordinates
Wine_x = df_wine_3d['x']
Wine_y = df_wine_3d['y']
Wine_z = df_wine_3d['z']

# Making plot
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter(Wine_x ,Wine_y , Wine_z )
ax.set_xlim(-20,20)
ax.set_ylim(-6,4)
ax.set_zlim(-10,12)
plt. figure(figsize=(400, 400))
plt.show()
Wine quality dataset embedded in a 3D space

References:

  1. Hinton, G. & Roweis, S. (2002). Stochastic Neighbor Embedding. Neural Information Processing Systems. https://proceedings.neurips.cc/paper_files/paper/2002/file/6150ccc6069bea6b5716254057a194ef-Paper.pdf
  2. van der Maaten, L. & Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research. https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

--

--