Doc2Vec tutorial using Gensim

The official Doc2Vec is great (http://rare-technologies.com/doc2vec-tutorial/), but I had some problems that needed to be resolved when I was trying it out, one of them being a bit confused on how to handle full documents instead of just sentences.

At the end of this tutorial you will have a fixed size vector for a full document. You can for instance feed this document vector into a machine learning classification algorithm (an SVM or other) or you could calculate the cosine distance between two different document to determine how semantically similar they are. I will do a cosine similarity measure between two documents as an example at the end of the post.

Additional discussions can also be found at https://www.kaggle.com/c/word2vec-nlp-tutorial/forums/t/12287/using-doc2vec-from-gensim

Starting off

I will presume you have a folder with a range of documents you would like to compare and train on. There are some factors that will determine how accurate your model and results are in the end:

  • Size of the corpus, no of document. More documents → better.
  • Similarity of documents. More similar → better

Doc2Vec is using two things when training your model, labels and the actual data. The labels can be anything, but to make it easier each document file name will be its’ label.

Load the labels and data

The first thing you would do when this is all done is put all your documents file paths in an array, this will also be our labels:

from os import listdir
from os.path import isfile, join
docLabels = []
docLabels = [f for f in listdir("myDirPath") if f.endswith('.txt')]

DocLabels now only contains the full filename of your document, what you need to do now is also load the contents of the files to use in your training.

data = []
for doc in docLabels:
data.append(open(“myDirPath/” + doc, ‘r’)

If your corpus is to large this might fail since we are loading everything into memory.

Preparing the data for Gensim Doc2vec

Gensim Doc2Vec needs model training data in an LabeledSentence iterator object. See the original tutorial for more information about this. We will use a modified version of the LabeledSentenceSentence class from the original tutorial.

This is the original:

class LabeledLineSentence(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
for uid, line in enumerate(open(filename)):
yield LabeledSentence(words=line.split(), labels=[‘SENT_%s’ % uid])

The main difference between this tutorial and the original tutorial is that they are training the model using a set of sentences, while we are using full documents. The changes made to this class is the most important that reflect this:

class LabeledLineSentence(object):
def __init__(self, doc_list, labels_list):
self.labels_list = labels_list
self.doc_list = doc_list
def __iter__(self):
for idx, doc in enumerate(self.doc_list):
yield LabeledSentence(words=doc.split(),labels=[self.labels_list[idx]])

First off, we change the constructor parameters to be able to supply both the raw data and the list of labels:

def __init__(self, doc_list, labels_list):

Next we change the iter to loop through all the docs, and put the documents filename as a the label for each document.

for idx, doc in enumerate(self.doc_list):
yield LabeledSentence(words=doc.split(),labels=[self.labels_list[idx]])

In gensim the model will always be trained on a word per word basis, regardless if you use sentences or full documents as your iter-object when you build the model. That is why we split the document into an array of words using

doc.split()

We could probably slightly better results by using NLTK tokeniser, but probably won’t matter in the end.

Training the model

First create the iter object using your modified version of the DocIterator described above. Pass it the data and your labels,

it = DocIt.DocIterator(data, docLabels)

Next we will use it to start training our model:

model = gensim.models.Doc2Vec(size=300, window=10, min_count=5, workers=11,alpha=0.025, min_alpha=0.025) # use fixed learning rate
model.build_vocab(it)
for epoch in range(10):
model.train(it)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no deca
model.train(it)

Why we change the learning rate during our training is best explained in the original tutorial http://rare-technologies.com/doc2vec-tutorial/.

Next it is always nice to save your model, since training it can take a while, I trained it on 100 000 documents, and it took on a Macbook Prop i5 about 1–2 hours.

model.save(“doc2vec.model”)

Test your model

Try out your model to see if it makes sense, there are some built in functions in Gensim which you could use real quick.

print model.most_similar(“documentFileNameInYourDataFolder”)

That is it! Now you can use the already trained vectors to compute distance between them or use it in a SVM for classification or similar. To get the raw fixed size vector of each set use

model[“documentFileNameInYourDataFolder”]

I work at Meltwater where we do all different kinds of things in NLP, deep learning and general machine learning to gain insights to help our customers. Check it out www.meltwater.com