A simple explanation of document embeddings generated using Doc2Vec

Amar Budhiraja
3 min readMay 14, 2018

--

In recent years, word embeddings have gained a lot of popularity and while there are a lot of tutorials and posts on Word2Vec and Glove, there are barely any resources to help someone understand document related embedding approaches. In this post, I will attempt to convey the two document embeddings models from Paragraph Vector (more popularly known as Doc2Vec) — Distributed Memory (PV-DM) and Distributed Bag Of Words (DBOW) and point to some other document embedding approaches in last part of this post.

Paragraph Vector (Doc2Vec) is supposed to be an extension to Word2Vec such that Word2Vec learns to project words into a latent d-dimensional space whereas Doc2Vec aims at learning how to project a document into a latent d-dimensional space. In this post, I will first discuss PV-DM followed by DBOW.

PV-DM

The basic idea behind PV-DM is inspired from Word2Vec. In the CBOW model of Word2Vec, the model learns to predict a center word based on the context. For example, given a sentence “The cat sat on sofa”, CBOW model would learn to predict the word “sat” given the context words — the, cat, on and sofa. Similarly, in PV-DM, the central idea is : randomly sample consecutive words from a paragraph and predict a center word from the randomly sampled set of words by taking as input — the context words and a paragraph id.

Let’s look at the model diagram for some more clarity. In the given model, we see Paragraph Matrix, Average/Concatenate and Classifier sections. Paragraph matrix is the matrix where each column represents the vector of a paragraph. By Average/Concatenate, it means whether the word vectors and paragraph vector are averaged or concatenated. Lastly, the Classifier part takes the hidden layer vector (the one that was concatenated/averaged) as input and predicts the center word.

Matrix D has the embeddings for “seen” paragraphs (i.e. arbitrary length documents), the same way Word2Vec models learns embeddings for words. For unseen paragraphs, the model is again ran through gradient descent (5 or so iterations) to infer a document vector.

DBOW

Distributed Bag Of Words (DBOW) model is slightly different from the PVDM model. The DBOW model “ignores the context words in the input, but force the model to predict words randomly sampled from the paragraph in the output.” For the above example, let’s say that the model is learning by predicting 2 sampled words. So, in order to learn the document vector, two words are sampled from { the, cat, sat, on, the, sofa}, as shown in the diagram.

DBOW model

--

--