Doc2vec in a simple way

Deepak Mishra
Feb 8, 2017 · 3 min read

Today I am going to demonstrate a simple implementation of nlp and doc2vec. The idea is to train doc2vec model using gensim v2 and python2 from text document.
I had about 20 text files to start with. Although the 20 document corpus seems small but the perk is it takes around 2 minutes to train the model.

Lets start implementing

#Import all the dependencies
import gensim
from nltk import RegexpTokenizer
from nltk.corpus import stopwords
from os import listdir
from os.path import isfile, join

Loading data and file names into memory

#now create a list that contains the name of all the text file in your data #folderdocLabels = []
docLabels = [f for f in listdir(“PATH TO YOU DOCUMENT FOLDER”) if
f.endswith(‘.txt’)]
#create a list data that stores the content of all text files in order of their names in docLabelsdata = []
for doc in docLabels:
data.append(open(‘PATH TO YOU DOCUMENT FOLDER’ + doc).read())

Implementing NLTK

Its time for simple text cleaning before training the model. Building tokenizer object, stopwords set in English from NLTK. These are basic nltk operation done before training any model on text corpus. They are many more but their implementation depends on the problem.

tokenizer = RegexpTokenizer(r’\w+’)
stopword_set = set(stopwords.words(‘english’))
#This function does all cleaning of data using two objects abovedef nlp_clean(data): new_data = []
for d in data:
new_str = d.lower()
dlist = tokenizer.tokenize(new_str)
dlist = list(set(dlist).difference(stopword_set))
new_data.append(dlist)
return new_data

Creating an class to return iterator object.

This class collects all documents from passed list doc_list and corresponding label_list and returns an iterator over those documents

class LabeledLineSentence(object):    def __init__(self, doc_list, labels_list):        self.labels_list = labels_list
self.doc_list = doc_list
def __iter__(self): for idx, doc in enumerate(self.doc_list):
yield gensim.models.doc2vec.LabeledSentence(doc,
[self.labels_list[idx]])

Cleaning the data using the nlp_clean method

data = nlp_clean(data)

Now we have “docLabels” that stores unique labels for all documents and “data” that stores the corresponding data of that document.

#iterator returned over all documents
it = LabeledLineSentence(data, docLabels)

Now creation of Doc2Vec model requires parameter like — ‘size’ is number of features, ‘alpha’ is learning-rate, ‘min_count’ is for neglecting infrequent words.

model = gensim.models.Doc2Vec(size=300, min_count=0, alpha=0.025, min_alpha=0.025)
model.build_vocab(it)
#training of model
for epoch in range(100):
print ‘iteration ‘+str(epoch+1)
model.train(it)
model.alpha -= 0.002
model.min_alpha = model.alpha
#saving the created model#model.save(‘doc2vec.model’)
print “model saved”

Now since we have trained the model and saved it, we can directly use the saves model for determining the vector of any document and comparing queries on them.

#loading the model
d2v_model = gensim.models.doc2vec.Doc2Vec.load(‘doc2vec.model’)
#start testing
#printing the vector of document at index 1 in docLabels
docvec = d2v_model.docvecs[1]
print docvec
#printing the vector of the file using its name
docvec = d2v_model.docvecs[‘1.txt’] #if string tag used in training
print docvec
#to get most similar document with similarity scores using document-index
similar_doc = d2v_model.docvecs.most_similar(14)
print similar_doc
#to get most similar document with similarity scores using document- name
sims = d2v_model.docvecs.most_similar(‘1.txt’)
print sims
#to get vector of document that are not present in corpus
docvec = d2v_model.docvecs.infer_vector(‘war.txt’)
print docvec

This is it. This is the simplest approach to implement and test doc2vec. Hope you liked it. Feel free to comment.

Also refer to my new blog about doc2vec with gensim 3.4 and python3 here https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5

Deepak Mishra

Written by

AI and NLP Chatbot Developer @Nova-one India. Graduated from DA-IICT.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade