Learn Document Embedding using Facebook StarSpace

Published in

Deep Learning HK

3 min readJun 23, 2019

This seems to match with the title “Starspace”

Recently I have come across a tool called StarSpace which is created by Facebook. StarSpace catches my eyes because it learns to represent different objects into the same embedding space. In order words, it learns embeddings and used these embeddings to complete a lot of different tasks, below are some examples

Word / Tag embeddings (map from a short text to relevant hashtags)
Document recommendation (embed and recommend documents for users based on their historical likes/click data)
Sentence Embeddings (given the embedding of one sentence, one can find semantically similar/relevant sentences)

StarSpace could archive many more tasks, as long as you prepare the data in the right format. And in this blog post, I am going to use StarSpace to learn document embeddings to predict the category of the document, all the code is available here.

Compute Document Similarity Using Autoencoder With Triplet Loss

Generate embeddings using Denoising Autoencoder with Triplet Loss and then calculate cosine similarity of embeddings

medium.com

In my previous post, I have showed how to compute document embeddings using autoencoder with triplet loss and the learned embeddings is able to preserve categorical similarity between documents. And now I am going to use StarSpace to do that same task one more time.

Tasks

Using news content to predict its category
Examine the learned embeddings

Prepare training data

Since StarSpace does not do tokenisation, I have to first tokenise the news.

Note that each line corresponds to a news and its label, all separated by space.

Training

Following command is used to train a model using StarSpace and then use the model to get back the embeddings.

> starspace train -trainFile uci_train_starspace_formatted.txt -model uci_starspace -trainMode 0 -validationFile uci_validate_starspace_formatted.txt -dim 50 -epoch 50 -negSearchLimit 1 -thread 20 -lr 0.001> embed_doc uci_starspace uci_train_starspace_formatted.txt > uci_train_starspace_embed.txt

Results

Same as the last blog post, I will evaluate the performance of the embeddings by checking the AUROC of cosine similarity of similar articles (article of the same category or same story).

The graphs above compare the embeddings trained using StarSpace with simple Tfidf, we see that StarSpace embeddings is better in preserving categorical similarity.

AUROC results using Testing Data
+---------------------+--------------+-----------+
|                     | Category Sim | Story Sim |
+---------------------+--------------+-----------+
| Triplet Autoencoder |         0.78 |      0.92 |
| StarSpace           |         0.93 |      0.90 |
+---------------------+--------------+-----------+

The above table compare the StarSpace embeddings with the triplet embeddings (trained in previous blog post), we see that StarSpace does a significantly better job in preserving categorical similarity without too much loss in story similarity.

Conclusion

StarSpace seems to be a powerful tool to train embeddings of documents, it is very fast (the training took just a few minutes on my laptop) and easy to use.