t-SNE and word embedding— Weekend of a Data Scientist

Alexander Osipenko
Cindicator
Published in
3 min readAug 17, 2018
example of visualization with t-SNE and word2vec

Weekend of a Data Scientist is series of articles with some cool stuff I care about. Idea is to spend weekend by learning something new, reading and coding.

This week I’ve been reading papers about t-SNE (t-distributed stochastic neighbor embedding). So here is what I understood from them.

What is t-distributed stochastic neighbor embedding?

t-SNE is a technique of non-linear dimensionality reduction and visualization of multi-dimensional data. Original SNE came out in 2002, and in 2008 was proposed improvement for SNE where normal distribution was replaced with t-distribution and some improvements were made in findings of local minimums.

Where can it be used?

It’s a popular tool for visualization of word embedding, but in general, you can use t-SNE to visualize any high-dimensional data.

Some of the math behind it

In the beginning, we have a dataset where each data point is a vector with high dimensionality, and at the end, we need to get new dataset in 2D or 3D space, but each data point needs to maintain the structure and patterns that existed in the original dataset.

  1. Transform multi-dimensional Euclidian distance between data points into conditional probabilities, that reflects a relationship between points.
shows how point Xj close to point Xi with gaysian distribution abound Xi with deviation σ

2. σ-must be chosen for each data point individually, in order to do that authors use perplexity.

3. Perlexity can be interpolated as a smoothed estimation of a number of neighbors that can affect on point Xi, and defined as a hyper-parameter of the method, authors recommending from 5 to 50.

4. So we have probabilities and now we need to determine similarities between pairs of data points.

5. In the end, the algorithm performs minimization of the Kullback–Leibler divergence with respect to the points using gradient descent. The result of this optimization is a map that reflects the similarities between the high-dimensional inputs well.

So how to use it?

Fortunately t-SNE is already implemented in sklearn so you don’t need to write it from scratch! I tried to visualize 2D scatter of similar words using word2vec and tSNE. I didn't train word2vec model, because you can find online pre-trained for a decent amount of languages, for example here.

For this example, I made a 3D-scatter based on the lyrics from Radiohead’s Ok Computer album. Here similar words are located closer to each other. It was fun to play with it!

References:

  1. Stochastic Neighbor Embedding
  2. Visualizing Data using t-SNE

Did you try to mess around with t-SNE and word embedding?

--

--

Alexander Osipenko
Cindicator

Leading/Coaching/Building Data Science teams from the scratch