Doc2Vec — Computing Similarity between Documents

Abdul Hafeez Fahad
Red Buffer
Published in
8 min readMay 18, 2021

The article aims to provide you an introduction to Doc2Vec model and how it can be helpful while computing similarities between documents.

There are many challenging tasks in the domain of Natural Language Processing that can be achieved if we convert the contextual data into low-dimensional vectors and when we talk about converting a text document to its numerical representation, that’s where doc2vec model comes into play. However, we can achieve many tasks using doc2vec but today we are just going to focus on computing the similarity between documents so that you are able to identify plagiarized documents, get recommendations of similar articles, and much more.

Introduction

Doc2vec is an unsupervised machine learning algorithm that is used to convert a document to a vector. This concept was presented by Mikilov and Le in this article. Now, as you have had the gentle introduction of doc2vec, I would like to gear your attention towards word2vec because doc2vec is heavily dependent upon word2vec, and describing doc2vec without word2vec will actually miss the point.

Word2Vec

As the name itself describes the algorithm, word2vec model produces vectors of the words. Sometimes it is easy to develop the models using words by simply using the technique of one-hot encoding but in such methods, the words in a sentence do not maintain their meaning. For e.g., if we encode the word king as id_2, man as id_4, and France as id_6 then all these words would have the same relationship with each other but what happens if we want to maintain their relationship with each other i.e. king should have a strong relationship with man rather than France? This is where word2vec model can be really helpful because it can maintain the relationship between the words.

Relationships between different words in vector space

Word2vec representation is developed using 2 algorithms:

  1. Skip-Gram
  2. Continuous Bag-of-Words
Visual Representation of the main difference between CBOW and Skip-Gram

Skip-Gram

The skip-gram model tries to predict the surrounding context with a given target word. As displayed in the following figure, when a target word “sat” is sent to this model, it tries to predict its surrounding context i.e. “The cat sat on the mat”.

Visual Representation of Skip-Gram

Continuous Bag-of-Words (CBOW)

As we have defined above the process of skip-gram, keep in mind that it is totally the opposite of what CBOW does. CBOW model tries to predict the next word according to the context. As displayed in the following figure, when the context “the cat sat” is sent to this model, it tries to predict the next word according to the context and in our example, the next predicted word would be “on”.

Visual Representation of Continuous Bag-of-words (CBOW)

Doc2Vec

After having a brief introduction about word2vec, it will now be easier to understand how doc2vec works.

As I have mentioned above, the goal of doc2vec is to compute a numeric representation of a document. Doc2vec is almost similar to word2vec but unlike words, a logical structure is not maintained in documents, so while developing doc2vec another vector named Paragraph ID is added into it.

Distributed Memory version of Paragraph Vector (PV-DM)

Distributed Memory version of Paragraph Vector (PV-DM)

By looking at the above figure, you must be thinking that it is almost similar to the visual representation of the CBOW model. Well, you are right but there is an additional feature vector added through which the uniqueness of the document can be identified. While training such a model, the vectors named as ‘W’ are the word vectors that holds the numeric representation and represent the concept of a word. Similarly, the vector named ‘D’ is the document vector that holds the numeric representation and represents the concept of a document.

Words version of Paragraph Vector (PV-DBOW)

Hang on! If there is an extension of the CBOW model in doc2vec then is there any extension of the Skip-Gram model in doc2vec? Yes, there is a similar algorithm to skip-gram named as Words version of Paragraph Vector (PV-DBOW).

Words version of Paragraph Vector (PV-DBOW)

Important Point

While using word2vec, CBOW is much faster than the Skip-Gram but in the case of doc2vec, PV-DM (extension of CBOW) computes much slower than PV-DBOW (extension of Skip-Gram) because it consumes less memory as the word vectors are not being saved.

Now as we have understood the whole process of word2vec and doc2vec, let’s start implementing doc2vec.

Heading Over to the Code and Implementing it Ourselves

The computation of similarity between the documents is a very challenging task in the domain of Natural Language Processing. Two documents can be similar if their semantic context is similar and manually identifying the similarity between a large number of documents can be really difficult. In order to make this an easy task, we would have our machines figure out the similarity between documents using doc2vec.

The data

I will be using data of only 3 sentences as an example for the demonstration of doc2vec so that we can easily identify the similarity between the text but using doc2vec we can use a huge dataset and train our models accordingly.

Installing Gensim

For the implementation of doc2vec, we would be using a popular open-source natural language processing library known as Gensim (Generate Similar) which is used for unsupervised topic modeling. You can install it on your machines by using the following command.

pip install gensim

Importing all the dependencies

import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from gensim.models.doc2vec import Doc2Vec

Preparation of data for training our doc2vec model

data = ["The process of searching for a job can be very stressful, but it doesn’t have to be. Start with a\
well-written resume that has appropriate keywords for your occupation. Next, conduct a targeted job search\
for positions that meet your needs.",
"Gardening in mixed beds is a great way to get the most productivity from a small space. Some investment\
is required, to purchase materials for the beds themselves, as well as soil and compost. The\
investment will likely pay-off in terms of increased productivity.",
"Looking for a job can be very stressful, but it doesn’t have to be. Begin by writing a good resume with\
appropriate keywords for your occupation. Second, target your job search for positions that match your\
needs."]

As displayed in the above snippet, we will be using three sentences as our training data. Before moving over to the training part, we will be tagging our data.

Tagging the data

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]

The output after tagging our data would be as follows

print (tagged_data)Output:[TaggedDocument(words=['the', 'process', 'of', 'searching', 'for', 'a', 'job', 'can', 'be', 'very', 'stressful', ',', 'but', 'it', 'doesn', '’', 't', 'have', 'to', 'be', '.', 'start', 'with', 'a', 'well-written', 'resume', 'that', 'has', 'appropriate', 'keywords', 'for', 'your', 'occupation', '.', 'next', ',', 'conduct', 'a', 'targeted', 'job', 'search', 'for', 'positions', 'that', 'meet', 'your', 'needs', '.'], tags=['0']), TaggedDocument(words=['gardening', 'in', 'mixed', 'beds', 'is', 'a', 'great', 'way', 'to', 'get', 'the', 'most', 'productivity', 'from', 'a', 'small', 'space', '.', 'some', 'investment', 'is', 'required', ',', 'to', 'purchase', 'materials', 'for', 'the', 'beds', 'themselves', ',', 'as', 'well', 'as', 'soil', 'and', 'compost', '.', 'the', 'investment', 'will', 'likely', 'pay-off', 'in', 'terms', 'of', 'increased', 'productivity', '.'], tags=['1']), TaggedDocument(words=['looking', 'for', 'a', 'job', 'can', 'be', 'very', 'stressful', ',', 'but', 'it', 'doesn', '’', 't', 'have', 'to', 'be', '.', 'begin', 'by', 'writing', 'a', 'good', 'resume', 'with', 'appropriate', 'keywords', 'for', 'your', 'occupation', '.', 'second', ',', 'target', 'your', 'job', 'search', 'for', 'positions', 'that', 'match', 'your', 'needs', '.'], tags=['2'])]

Now that we have tagged our data, lets start training our model

Initializing doc2vec

model = gensim.models.doc2vec.Doc2Vec(vector_size=30, min_count=2, epochs=80)

Building the vocabulary of tagged data

model.build_vocab(tagged_data)

Training doc2vec

model.train(tagged_data, total_examples=model.corpus_count, epochs=80)

After doc2vec has been trained, save it as follows

model.save("d2v.model")

As we have saved our model, it’s ready for implementation. Load the model and let’s compute the most similarity between the sentences

model = Doc2Vec.load("d2v.model")

Finding the most similar sentence using tags

similar_doc = model.docvecs.most_similar('0')
print(similar_doc[0])
Output:
('2', 0.9393066167831421)

The most similar sentences computed using doc2vec are sentences with tag 0 and 2. Lets visualize the sentences

#The most similar sentences computed by doc2vecSentence with tag '0': The process of searching for a job can be very stressful, but it doesn’t have to be. Start with a well-written resume that has appropriate keywords for your occupation. Next, conduct a targeted job search for positions that meet your needs.Sentence with tag '2': Looking for a job can be very stressful, but it doesn’t have to be. Begin by writing a good resume with appropriate keywords for your occupation. Second, target your job search for positions that match your needs.#The sentence with less similaritySentence with tag '1': Gardening in mixed beds is a great way to get the most productivity from a small space. Some investment is required, to purchase materials for the beds themselves, as well as soil and compost. The investment will likely pay-off in terms of increased productivity.

Inferring a vector

Inferring a vector is used for finding out the vector of a document that was not part of our training data

test_data = word_tokenize("When your focus is to improve employee performance, it’s essential to encourage ongoing\
dialogue between managers and their direct reports. Some companies encourage supervisors\
to hold one-on-one meetings with employees as a way to facilitate\
two-way communication.".lower())
v1 = model.infer_vector(test_data)
print("V1_infer", v1)
Output:V1_infer [-0.06755273 0.0633966 0.06744069 0.01091933 -0.01968639 -0.01889984
-0.04448636 -0.00854152 -0.25066498 -0.03219931 0.03350157 -0.02680573
-0.04993293 -0.2456862 -0.02887128 -0.12966427 0.04222799 -0.02136624
-0.10524843 -0.07345396 0.07305007 0.00686409 -0.09619413 0.06575447
0.15723655 0.05926161 0.06410413 0.00242155 0.01862393 -0.11729769]

Summary

We have seen that we can get much help by using doc2vec model. Additionally, this shows us that how the numeric representation of text documents can be helpful in web search, spam filtering, document retrieval, detecting plagiarized documents, etc.

That's all for now! Feel free to put a comment below if you have any suggestions or questions.

--

--

Abdul Hafeez Fahad
Red Buffer

Senior AI Engineer | Data Scientist | Generative AI | LLM's | NLP | ML/DL | Speaker @ GDG ISB | Computer Science Graduate