Dimensionality Reduction with t-SNE
Introduction
How do you imagine four dimensions? My best attempt to it is drawing a hypercube, but what about five, six and more? We have to admit that our mind is limited to the 3D world, however in the data science, you are constantly facing data of all sort of the dimensions.
Dimensionality reduction comes to the rescue! This article is a brief overview of t-SNE, popular algorithm to reduce dimensions of your data.
Word embeddings
In this post I will take a look at dimensionality reduction though the prism of word embeddings. Long story short, these are words encoded into multidimensional vectors. Common sizes of embeddings are ranging from 50 to 300. Sometimes it is important to understand what are relationships between particular words vectors, which is nearly impossible in their original multidimensional form. Dimensionality reduction comes to the resque!
If you are new to word embeddings, please take a look at my two-minute explanation of this term by example:
Riga is the capital of Latvia
For the demonstration purpose we will use four GloVe word vectors: “Riga”, “Latvia”, “Capital” and “Country”. Looking at original 300D word vectors gives no clue to the relationships between words, however following short code snippet enables t-SNE reduce word vectors to the size of 2:
This trick allows us to make a scatter plot of 2D vectors and build a deeper understanding of the relationship between words: “Riga” to “Latvia” is as “Capital” to “Country”. In other words “Riga is the capital of Latvia”
Dimensionality reduction is a handy technique apart from the word embedding visualization you will find useful, for example, when dealing with multidimensional feature input.
Code
https://www.kaggle.com/dmitryyemelyanov/word-vector-dimensionality-reduction-with-t-sne