Combining Word Embeddings to form Document Embeddings

Yatin Vij
Analytics Vidhya
Published in
3 min readSep 24, 2019

This article follows up on the introductory article about Word Embeddings which can be found here.

Now that we have explored the realm of Word Embeddings and are quite comfortable with it, let’s step into how we can put the word embeddings to use for generating features that can be passed to traditional machine learning algorithms like random forests, xgboost, etc.

Why should we investigate such an approach to represent documents?

The widely used feature extraction technique that we always use for text is TF-IDF. This technique has been proven to work well with text data and traditional algorithms and is also quite explainable. TF-IDF generates features of size N x V, where N is the number of observations and V is the vocabulary size. This approach can help reduce the size of the features to the choice of embedding size.

This article is an extension of the “Supercharging Word Vectors with TF-IDF” article which combined FastText embeddings with TF-IDF. The article explores different word embedding generation algorithms like Word2Vec (Skip-Gram and CBOW), FastText, etc. and the different ways of combining these word embeddings.

Different ways of combining word embeddings explored are:

  • TF-IDF Weighted Word Embeddings: These embeddings are combined with the TF-IDF scores for each word of each sentence by multiplying the word embedding with the word’s tf-idf score for the sentence, this is done for all the words in the sentence and the result, of multiplying the word tf-idf with the word embedding, is accumulated and divided by the accumulated tf-idf scores of the words in sentence. The obtained vector is used as the sentence embedding.
  • Doc2Vec: The doc2vec algorithm embeds the documents in the vector space using the word2vec model while adding another feature (Paragraph ID), which is document unique. Every paragraph is mapped to a unique vector, represented by a column in matrix D and every word is also mapped to a unique vector, represented by a column in matrix W. The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context. The model above is called Distributed Memory version of Paragraph Vector (PV-DM). PV-DM is superior but slow to train. Another method for doc2vec is the Distributed Bag of Words version of Paragraph Vector (PV-DBOW). This algorithm is faster and consumes less memory, since there is no need to save the word vectors.
PV-DM Model and PV-DBOW Model
  • Averaging the Word Embeddings: The embeddings were also learnt as a part of the model as well. These embeddings were used to generate document embeddings by averaging the embedding of all the words in the documents.

The data used was Amazon reviews labelled with sentiments of the review as positive or negative. The embedding size was chosen as 20 and the context size was chosen as 5. The first thousand observations were used to conduct the analysis.

The results of the experiment were the following:

Conclusion
The jointly learnt embeddings as a part of training the network based on sentiment labels appears to be more accurate when the embeddings of the words are averaged. Thus, adding the TF-IDF information does not add much to the model accuracy.

The pretrained ELMo and BERT representations come close to the task specific embeddings, thus, proving their applicability to different tasks out of the box.

The results table proves that averaging word embeddings to form document embeddings is superior than the other alternatives tried in the experiment.

I hope you enjoyed the article, please leave a comment if you found this article useful.

The notebook for this experiment can be found here.

References
https://towardsdatascience.com/supercharging-word-vectors-be80ee5513d
https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e

--

--