Doc2vec in a simple way

Today I am going to demonstrate a simple implementation of nlp and doc2vec. The idea is to train doc2vec model using gensim and python from text document.
I had about 20 text files to start with. Although the 20 document corpus seems small but the perk is it takes around 2 minutes to train the model.

Lets start implementing

#Import all the dependencies
import gensim
from nltk import RegexpTokenizer
from nltk.corpus import stopwords
from os import listdir
from os.path import isfile, join

Loading data and file names into memory

#now create a list that contains the name of all the text file in your data #folder
docLabels = []
docLabels = [f for f in listdir(“PATH TO YOU DOCUMENT FOLDER”) if
f.endswith(‘.txt’)]
#create a list data that stores the content of all text files in order of their names in docLabels
data = []
for doc in docLabels:
data.append(open(‘PATH TO YOU DOCUMENT FOLDER’ + doc).read())

Implementing NLTK

Its time for simple text cleaning before training the model. Building tokenizer object, stopwords set in English from NLTK. These are basic nltk operation done before training any model on text corpus. They are many more but their implementation depends on the problem.

tokenizer = RegexpTokenizer(r’\w+’)
stopword_set = set(stopwords.words(‘english’))
#This function does all cleaning of data using two objects above
def nlp_clean(data):
   new_data = []
for d in data:
new_str = d.lower()
dlist = tokenizer.tokenize(new_str)
dlist = list(set(dlist).difference(stopword_set))
new_data.append(dlist)
return new_data

Creating an class to return iterator object.

This class collects all documents from passed list doc_list and corresponding label_list and returns an iterator over those documents

class LabeledLineSentence(object):
    def __init__(self, doc_list, labels_list):
        self.labels_list = labels_list
self.doc_list = doc_list
    def __iter__(self):
        for idx, doc in enumerate(self.doc_list):
yield gensim.models.doc2vec.LabeledSentence(doc,
[self.labels_list[idx]])

Cleaning the data using the nlp_clean method

data = nlp_clean(data)

Now we have “docLabels” that stores unique labels for all documents and “data” that stores the corresponding data of that document.

#iterator returned over all documents
it = LabeledLineSentence(data, docLabels)

Now creation of Doc2Vec model requires parameter like — ‘size’ is number of features, ‘alpha’ is learning-rate, ‘min_count’ is for neglecting infrequent words.

model = gensim.models.Doc2Vec(size=300, min_count=0, alpha=0.025, min_alpha=0.025)
model.build_vocab(it)
#training of model
for epoch in range(100):
print ‘iteration ‘+str(epoch+1)
model.train(it)
model.alpha -= 0.002
model.min_alpha = model.alpha
model.train(it)
#saving the created model
#model.save(‘doc2vec.model’)
print “model saved”

Now since we have trained the model and saved it, we can directly use the saves model for determining the vector of any document and comparing queries on them.

#loading the model
d2v_model = gensim.models.doc2vec.Doc2Vec.load(‘doc2vec.model’)
#start testing
#printing the vector of document at index 1 in docLabels
docvec = d2v_model.docvecs[1]
print docvec
#printing the vector of the file using its name
docvec = d2v_model.docvecs[‘1.txt’] #if string tag used in training
print docvec
#to get most similar document with similarity scores using document-index
similar_doc = d2v_model.docvecs.most_similar(14)
print similar_doc
#to get most similar document with similarity scores using document- name
sims = d2v_model.docvecs.most_similar(‘1.txt’)
print sims
#to get vector of document that are not present in corpus 
docvec = d2v_model.docvecs.infer_vector(‘war.txt’)
print docvec

This is it. This is the simplest approach to implement and test doc2vec. Hope you liked it. Feel free to comment.