Understanding TagSpace — Supervised Learning of Word Embeddings

Vishal R
Developer Community SASTRA
3 min readMar 25, 2020

I recently started exploring Machine Learning for text (so far, I had been working with images) and I was introduced to the Facebook’s StarSpace. For those who are unaware of StarSpace, this is how Facebook Research describes it.

StarSpace is a general-purpose neural model for efficient learning of entity embeddings for solving a wide variety of problems

One of the use cases mentioned in the repository is TagSpace for generating word / tag embeddings and that is what this article is about.

This article is not going to explain what embeddings are and why we need them as I believe there are more than enough articles that discuss it.

Earlier Approach of Generating Word Embeddings

There are several approaches to generating word embeddings. The popular ones being word2vec and GloVe. To understand why TagSpace is different from these, we need an understanding of how either of this works. For this article, we’ll just discuss word2vec.

Word2Vec

Word2Vec has a pretty straightforward way of learning word embeddings. Given a sentence, We take a word (say W)in a sentence and calculate the embeddings of all the words around it. Since the model is not yet trained, these embeddings are going to be just random numbers. Then, we use these embeddings to predict the word (W) itself. Using this approach to train the model over a huge corpus of sentences and huge vocabulary, should eventually result in the neural model to generate embeddings that hold some information about the word itself. (This lecture by Prof. Christopher Manning should provide a better understanding of how it works).

How is TagSpace different?

One major difference between TagSpace and other methods like Word2Vec and GloVe is that TagSpace uses supervised learning while the others use unsupervised learning. Unsupervised learning algorithms train with an object of reconstruction i.e., the embeddings are used to predict the original text. In supervised learning, the model’s objective is to label a given text.

The Data for TagSpace

The biggest problem anyone will face with supervised learning is getting labelled data. But there is a simple solution for this use case. In Facebook posts (and any social media posts for that matter), the #hastags generally act as a label for what the article is talking about. This is the data TagSpace uses. It uses a huge corpus of text with the hashtags in/for those posts as labels.

The TagSpace Model

Source: #TagSpace

Given a sentence of length l, we generate a d dimensional vector for each of the word in the sentence resulting in a matrix of l x d. We then add a padding of K - 1 vectors where K is the size of the filter in the convolution layer. We pass the matrix (with the padding) to the convolution layer with H filters. This results in a matrix of size l x H. Then comes a Max pooling layer, activation layer (tanh) and a linear layer. This results in a vector of size d. We also generate a vector of size d for the tag. We then use these vectors to rank all the tags.

The scoring function to rank the tags is f(w, t) = econv(w) · elt(t). Where econv is the embedding of the document (in this case, a post) and elt is the embedding of the candidate tag.

Training the model

For each training sample, we sample a positive tag then sample random tags upto 1000 times until,

where m is the margin. A gradient step is then made to optimize the pairwise hinge loss.

The pairwise hinge loss

For info on the results of the paper and more accurate metrics of the data used, read the paper here.

--

--

Vishal R
Developer Community SASTRA

Data Scientist at Freshworks. Likes to talk about Machine Learning and plays the Harmonica. Slowly moving to Substack