Today I am going to demonstrate a simple implementation of nlp and doc2vec. The idea is to train doc2vec model using gensim v2 and python2 from text document.
I had about 20 text files to start with. Although the 20 document corpus seems small but the perk is it takes around 2 minutes to train the model.
Lets start implementing
#Import all the dependencies
from nltk import RegexpTokenizer
from nltk.corpus import stopwords
from os import listdir
from os.path import isfile, join
Loading data and file names into memory
#now create a list that contains the name of all the text file in your data #folderdocLabels = 
docLabels = [f for f in listdir(“PATH TO YOU DOCUMENT FOLDER”) if
f.endswith(‘.txt’)]#create a list data that stores the content of all text files in order of their names in docLabelsdata = 
for doc in docLabels:
data.append(open(‘PATH TO YOU DOCUMENT FOLDER’ + doc).read())
Its time for simple text cleaning before training the model. Building tokenizer object, stopwords set in English from NLTK. These are basic nltk operation done before training any model on text corpus. They are many more but their implementation depends on the problem.
tokenizer = RegexpTokenizer(r’\w+’)
stopword_set = set(stopwords.words(‘english’))#This function does all cleaning of data using two objects abovedef nlp_clean(data): new_data = 
for d in data:
new_str = d.lower()
dlist = tokenizer.tokenize(new_str)
dlist = list(set(dlist).difference(stopword_set))
Creating an class to return iterator object.
This class collects all documents from passed list doc_list and corresponding label_list and returns an iterator over those documents
class LabeledLineSentence(object): def __init__(self, doc_list, labels_list): self.labels_list = labels_list
self.doc_list = doc_list def __iter__(self): for idx, doc in enumerate(self.doc_list):
Cleaning the data using the nlp_clean method
data = nlp_clean(data)
Now we have “docLabels” that stores unique labels for all documents and “data” that stores the corresponding data of that document.
#iterator returned over all documents
it = LabeledLineSentence(data, docLabels)
Now creation of Doc2Vec model requires parameter like — ‘size’ is number of features, ‘alpha’ is learning-rate, ‘min_count’ is for neglecting infrequent words.
model = gensim.models.Doc2Vec(size=300, min_count=0, alpha=0.025, min_alpha=0.025)
model.build_vocab(it)#training of model
for epoch in range(100):
print ‘iteration ‘+str(epoch+1)
model.alpha -= 0.002
model.min_alpha = model.alpha#saving the created model#model.save(‘doc2vec.model’)
print “model saved”
Now since we have trained the model and saved it, we can directly use the saves model for determining the vector of any document and comparing queries on them.
#loading the model
d2v_model = gensim.models.doc2vec.Doc2Vec.load(‘doc2vec.model’)#start testing
#printing the vector of document at index 1 in docLabels
docvec = d2v_model.docvecs
print docvec#printing the vector of the file using its name
docvec = d2v_model.docvecs[‘1.txt’] #if string tag used in training
print docvec#to get most similar document with similarity scores using document-index
similar_doc = d2v_model.docvecs.most_similar(14)
print similar_doc#to get most similar document with similarity scores using document- name
sims = d2v_model.docvecs.most_similar(‘1.txt’)
print sims#to get vector of document that are not present in corpus
docvec = d2v_model.docvecs.infer_vector(‘war.txt’)
This is it. This is the simplest approach to implement and test doc2vec. Hope you liked it. Feel free to comment.
Also refer to my new blog about doc2vec with gensim 3.4 and python3 here https://firstname.lastname@example.org/doc2vec-simple-implementation-example-df2afbbfbad5