Unsupervised NLP task in Python with doc2vec
Introduction
Welcome to the third article of a series covering different NLP models used to solve an unsupervised task of sentence similarity. The previous ones are available in our publication and are about bag-of-words, tf-idf and LSI techniques.
In this article we will introduce the doc2vec embedding technique. It was presented in 2014 by Mikilov and Le in this paper and is based on the word2vec model.
The first section is an introduction to word2vec, which is necessary to understand doc2vec that is discussed in the second section. If you already are familiar with these methods, feel free to go directly to section four.
In last two sections of the article, we will solve an unsupervised NLP task using the doc2vec model.
Word2vec
Word2vec is an embedding technique that uses a 2 layers neural network model to learn word associations from a large corpus of text.
As the name implies, in this model each distinct word is represented as a vector. The word vectors are chosen carefully such that words that share common contexts in the corpus are located close to one another in the vector space. Word2vec can utilize either of two model architectures to produce a distributed representation of words: continuous bag-of-words or continuous skip-gram.
The continuous bag-of-words architecture (CBOW) predicts the current word from a sliding window of surrounding context words. Since it is a BOW model, the order of context words does not influence the prediction.
In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words.
According to this post from Google Code Archive, the CBOW architecture is faster while the skip-gram is slower but does a better job for infrequent words.
Doc2vec
The doc2vec principle is to use the word2vec model and add another vector, called Paragraph Vector. It means that after training the neural network, we will have
- the word vectors i.e. the vector representation of words,
- a document vector i.e. the vector representation of the document.
Some advantages of the paragraph vectors are the following:
- they are learned from unlabeled data and thus can work well for tasks that do not have enough labeled data;
- they inherit an important property of the word vectors: the semantics of the words (in this space, “powerful” is closer to “strong” than to “Paris”);
- they take into consideration the word order, at least in a small context.
The last two properties mean paragraph vectors address some of the key weaknesses of bag-of-words models.
There are two architectures for doc2vec as well, which are respectively similar to CBOW and the continuous skip-gram: Distributed Memory version of Paragraph Vector and Distributed Bag of Words version of Paragraph Vector. Let’s take a closer look at them.
Distributed Memory version of Paragraph Vector (PV-DM)
In the PV-DM architecture, the paragraph token is thought of as another word. It acts as a memory that remembers what is missing from the current context — or the topic of the paragraph.
The following figure shows how in this model the concatenation or average of the paragraph vector with a context of words is used to predict the next word. The paragraph vector represents the missing information from the current context and can act as a memory of the topic of the paragraph.
Distributed Bag of Words version of Paragraph Vector (PV-DBOW)
The PV-DBOW principle is to ignore the context words in the input, but force the model to predict words randomly sampled from the paragraph in the output. In addition to being conceptually simple, this model requires storing less data.
The following figure shows that the paragraph vector is trained to predict the words in a small window.
Problem description
The task we will solve is the same as the one from the previous articles.
We will perform document similarity between movies plots and a given query.
We will use the Wikipedia Movie Plots Dataset which is available at this page and consists in ~35000 movies.
Our solution
Let’s start installing the latest version of gensim and import all the packages we need.
!pip install — upgrade gensim
import pandas as pd
import gensim
from gensim.parsing.preprocessing import preprocess_documents
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
We can now load the dataset and store the plots into the corpus variable.
In order to avoid RAM saturation, we will only use movies with release year ≥ 2000.
df = pd.read_csv(‘wiki_movie_plots_deduped.csv’, sep=’,’)
df = df[df[‘Release Year’] >= 2000]
text_corpus = df[‘Plot’].values
The next step is to preprocess the corpus. Please refer to this article for full explanation of this operation. We will also convert the corpus into iterable of list of TaggedDocument which will be fed into the doc2vec model.
processed_corpus = preprocess_documents(text_corpus)
tagged_corpus = [TaggedDocument(d, [i]) for i, d in enumerate(processed_corpus)]
Let’s now train the model.
After performing hyperparameter tuning, we chose the following parameters values:
- dm = 0 means the PV-DBOW architecture will be used,
- vector_size = 200 means that the feature vectors will have 200 entries,
- window = 2 is the maximum distance between the current and predicted word,
- hs = 1 means that hierarchical softmax will be used for model training.
For full explanation of the gensim doc2vec model please refer to this documentation.
model = Doc2Vec(tagged_corpus, dm=0, vector_size=200, window=2, min_count=1, epochs=100, hs=1)
Let’s now write the plot of the movie we would like to get from the similarity query. Suppose new_doc is the string containing the movie plot. The code to find the 10 most similar movies to it is the following:
new_doc = gensim.parsing.preprocessing.preprocess_string(new_doc)
test_doc_vector = model.infer_vector(new_doc)sims = model.docvecs.most_similar(positive = [test_doc_vector])for s in sims:
print(f"{(s[1])} | {df['Title'].iloc[s[0]]}")
We will go through some examples.
- Title: Inception.
Plot summary: “Infiltrate in minds and extract information through a shared dream world. Different levels of dreams. Build a team to implant idea for a very powerful Japanese industrialist”.
Results:
0.36223283410072 | Inception
0.35902470350265 | Flower and Snake: Zero
0.35267341136932 | Darwin
0.34725219011307 | Zero
0.34414345026016 | Puli Vesham
0.34225952625275 | Leftenan Adnan
0.33951455354690 | Big Fish & Begonia
0.33947885036469 | Assassination
0.33865314722061 | Peaceful Warrior
0.33549726009369 | Smiley Face - Title: Wreck-It Ralph.
Plot summary: “In the arcade at night the videogame characters leave their games. The protagonist is a girl from a candy racing game who glitches”.
Results:
0.43112030625343 | Wreck-It Ralph
0.37918916344643 | Over the Hedge
0.37341749668121 | Dude, Where’s My Car?
0.34627544879913 | Fat Albert
0.33664789795876 | The Royal Bengal Tiger
0.33413895964622 | Mahasangram
0.33398652076721 | Rokto
0.33098044991493 | Tokyo Family
0.32873904705048 | Open Season
0.32647505402565 | Pannaiyarum Padminiyum - Title: Kill Bill.
Plot summary: “A blonde bride goes to Japan in search of her former boss and the gang responsible for the ambush she fell into four years earlier.”.
Results:
0.36277258396149 | Kill Bill Volume 1
0.35308435559273 | Kill Bill Volume 2
0.34135621786117 | Aravaan
0.33946585655212 | Kingsman: The Secret Service
0.33734089136124| The Grand Heist
0.33702278137207 | Side Effects
0.33538559079170 | Violent Cop
0.33469271659851 | Departures
0.33399999141693 | Rangi Taranga
0.33256906270981 | 19-Nineteen - Title: Kill Bill.
Plot summary: “A blonde bride goes to Japan in search of her former boss and the gang responsible for the trap she fell into four years earlier.”.
0.35924845933914 | Violent Cop
0.35417461395264 | Kill Bill Volume 1
0.35321629047394 | Hanamizuki
0.34909152984619 | Rajathandhiram
0.34550577402115 | Kill Bill Volume 2
0.34409400820732 | Hum Kisise Kum Nahin
0.34320613741875 | Brawl in Cell Block 99
0.34149655699730 | Dagudumootha Dandakor
0.33917436003685 | Kollaikaran
0.33302384614944 | The Amazing Adventures of the Living Corpse - Title: Star Wars.
Plot summary: “Jedi knight lightsaber starship”.
Results:
0.58338886499405 | Star Wars: Episode I — The Phantom Menace 3D
0.53465068340301 | Star Wars: The Last Jedi
0.53337401151657 | Star Wars: Episode II — Attack of the Clones
0.52835452556610 | Star Wars: The Force Unleashed
0.50489157438278 | Star Wars: Episode III — Revenge of the Sith
0.48146998882294 | Star Wars: The Clone Wars
0.46921348571777 | Rogue One: A Star Wars Story (film)
0.43271967768669 | Star Trek Nemesis
0.37957724928856 | Avatar
0.36706718802452 | Star Trek Beyond
Conclusions
The results obtained are really good. All the movies we were looking for are at the top position of the model outputs.
It is impressive how the doc2vec model performs with the telegram used for the Star Wars plot. However, it doesn’t always work so well: we had to change the Kill Bill query plot from “Blonde bride goes to Japan and kills many people” to “A blonde bride goes to Japan in search of her former boss and the gang responsible for the ambush she fell into four years earlier.” because the first input did not return any Kill Bill movie in the top ten matches.
Then why did the model perform so well with the Star Wars movies? The answer is that we used very specific words such as “Jedi”, “lightsaber” and “starship” which are very frequent in the plot of these movies.
In conclusion, the doc2vec model needs long queries as input and takes care of semantics of the words, in fact when we used “trap” instead of “ambush” in the Kill Bill query the results are still pretty good.
In the next article we will try to define a method to evaluate the performance of the models used. The method will be as objective as possible and based on a parameters choice which will be discussed as well.