A Dummy’s Guide to Word2Vec

Manan Suri
10 min readJan 21, 2022

--

I have always been interested in learning different languages- though the only French the Duolingo owl has taught me is, Je m’appelle Manan . My short stints at learning a third language helped me realise that vocabulary is not sufficient- contexts, semantics and syntactic features were also important to truly grasp the meaning from a language. So when I started doing NLP tasks recently, I felt really perplexed about how something like a bag-of-words which considers each word in a very independent light could really be efficient- and I got answers to this question when I learnt what word embeddings are and how they work. In this article, I’m going to be talking about word embeddings, specifically the word2vec model and how to take its advantage using the easy to use Gensim library.

Word embeddings

The emergence of language was a pivotal movement in the evolution of humanity. Although all species have their ways of communicating, we as humans are unique in having been the only ones who have mastered cognitive language communication. So while I know that “rat” refers to a small hairy rodent, my dog or my computer (at least in essence) doesn’t know that.

Therefore, any task aimed at processing language must first begin with focusing on how to represent the words essentially.

A preliminary method is a “bag of words” model which encodes words using a one-hot scheme.

If our dataset contains the sentences:

“I like the new movie!”, “I love the weather.”

Visualising the Bag-of-Words representation

Then we can have a vector representation of the words as:

I [1,0,0,0,0,0,0]

like [0,1,0,0,0,0,0]

the [0,0,1,0,0,0,0]

new [0,0,0,1,0,0,0]

movie [0,0,0,0,1,0,0]

love [0,0,0,0,0,1,0]

weather [0,0,0,0,0,0,1]

The sentences will then be represented as:

[1,1,1,1,1,0,0] and [1,0,1,0,0,1,1]

However, as you might have noticed, that this representation is not very effective at showing the semantic and syntactic relationships between words- they are encoded as individual its in a vector space and there is no way you can tell the words “love” and “like” have a similar connotation.

This is where word embeddings come in. Word embeddings are basically representations where contexts and similarities are captured by encoding in a vector space- similar words would have similar representations. We’re going to be discussing word2vec which is an effective word embedding technique.

Word2Vec

Word2Vec creates a representation of each word present in our vocabulary into a vector. Words used in similar contexts or having semantic relationships are captured effectively through their closeness in the vector space- effectively speaking similar words will have similar word vectors! History. Word2vec was created, patented, and published in 2013 by a team of researchers led by Tomas Mikolov at Google.

Let us consider a classic example: “king”, “queen”, “man”, “girl”, “prince”

Hypothetical features to understand word embeddings

In a hypothetical world, vectors could then define the weight of each criteria (for example royalty, masculinity, femininity, age etc.) for each of the given words in our vocabulary.

What we then observe is:

  • As expected, “king”, “queen”, “prince” have similar scores for “royalty” and “girl”, “queen” have similar scores for “femininity”.
  • An operation that removes “man” from “king”, would yield in a vector very close to “queen” ( “king”- “man” = “queen”)
  • Vectors “king” and “prince” have the same characteristics, except for age, telling us how they might possibly be semantically related to each other.

Word2Vec forms word embeddings that work in a similar fashion except for the fact that the criterion we have used for each of the words are not clearly determinable. What matters to us is the semantic and syntactic relations between words which can still be determined by our model without explicitly having defining features for units of the vector.

Word2Vec has also shown to identify relations like country-capital over larger datasets showing us how powerful word embeddings can be.

Embeddings generated by word2vec can further be used in NLP tasks, such as using it as an input to a CNN to classify text!

Model Architectures

Word2Vec essentially is a shallow 2-layer neural network trained.

  • The input contains all the documents/texts in our training set. For the network to process these texts, they are represented in a 1-hot encoding of the words.
  • The number of neurons present in the hidden layer is equal to the length of the embedding we want. That is, if we want all our words to be vectors of length 300, then the hidden layer will contain 300 neurons.
Understanding the neural network training of Word2Vec model
  • The output layer contains probabilities for a target word (given an input to the model, what word is expected) given a particular input.
  • At the end of the training process, the hidden weights are treated as the word embedding. Intuitively, this can be thought of as each word having a set of n weights (300 considering the example above) “weighing” their different characteristics (an analogy we used earlier).
The weight matrix of the hidden layer ends up becoming a lookup table for the given words and their vector representations!

There are two ways in which we can develop these embeddings:

  1. Continuous Bag-Of-Words (CBOW)

CBOW predicts the target-word based on its surrounding words.

For example, consider the sentence, “The cake was chocolate flavoured”.

The model will then iterate over this sentence for different target words, such as:

“The ____ was chocolate flavoured” being inputs and “cake” being the target word.

CBOW thus smoothes over the distribution of the information as it treats the entire context as one observation. CBOW is a faster algorithm than skipgrams and works well with frequent words.

2. Skipgram

Skipgram works in the exact opposite way to CBOW. Here, we take an input word and expect the model to tell us what words it is expected to be surrounded by.

Taking the same example, with “cake” we would expect the model to give us “The”, “was”, “chocolate”, “flavoured” for the given instance.

The statistical interpretation of this is that we treat each context-target pair as a new observation. Skipgrams work well with small datasets and can better represent less frequent words.

Training CBOW and Skipgram for word2vec

Using Gensim to train our own embeddings

We can easily train word2vec word embeddings using Gensim, which is, “is a free open-source Python library for representing documents as semantic vectors, as efficiently (computer-wise) and painlessly (human-wise) as possible.”

The dataset I used for this demo is the Coronavirus tweets NLP database from Kaggle. I am omitting the parts involving loading the dataset, preprocessing the text but you can check it out the complete implementation from this colab notebook . I preferred using this over larger datasets like IMDB because many of our applications involve similar sized datasets and thus it makes for a better representation for the average performance of the model.

  1. Training the embeddings

We import Word2Vec from gensim.model. Each input to the model must be a list of phrases, therefore we generate the input to the model by using the split() function on each line in our corpus of texts.

Setting up the gensim word2vec model, training it

We then set up the model and specify different parameters. Discussing briefly what they mean,

  • Size refers to the size of the word embedding that it would output.
  • Window refers to the maximum distance between the current and predicted word within a sentence.
  • The min_count parameter is used to set a minimum frequency for the words to be a part of the model: i.e. it ignores all words with count less than min_count.
  • workers- uses those many worker threads to train the model. This can be adjusted as per the number of cores your system has. In simple words, it refers to parallelism while training the model.
  • iter refers to the number of iterations for training the model.

2. Using the word2vec model

  • Finding the vocabulary of the model can be useful in several general applications, and in this case, provides us with a list of words we can try and use other functions.
List of words in the model’s vocabulary
  • Finding the embedding of a given word can be useful when we’re trying to represent sentences as a collection of word embeddings, like when we’re trying to make a weight matrix for the embedding layer of a network. I included this so that it can help your intuition of what the word vector looks like.
The word embedding for ‘computer’ trained by the given model. As you can see, it is is not possible to make sense of what these individual values mean, unlike the example I gave above which was completely hypothetical.
  • We can also find out the similarity between given words (the cosine distance between their vectors). Here we have tried ‘vladimir’ and ‘putin’, and compared it to ‘vladimir’ and ‘modi’, seeing how a stark distinction exists.
Comparing ‘vladimir’ with ‘putin’ and ‘modi’
  • With the gensim, we can also find the most similar words to a given word. This particularly shows the contextualising power of the model: look at words similar to covid: we get ‘coronavirus’, ‘virus’, ‘coronacrisis’,’disease’ and ‘corona’ as the top words. Similarly, when we try ‘india’, we get a list of words that are also countries! When we try a verb, such as ‘pay’, we get other forms of the same verb, ‘paid’, ‘paying’ and associated terms like ‘raise’, ‘bills’. This is exciting, considering our vocabulary is essentially not very large, and the dataset consists of a very specific situation.
Using the most_similar function to find words similar to a given word.
  • Similarly, we can use the same function to find analogies of the form: if x:y, then z:?. Here we enter the known relation x,y in the positive parameter, and the term who’s analogy has to be found in the negative parameter. Here, our model seems to have shown an understanding of nationalities by telling us, if ‘russian’ -> ‘russia’, ‘arab’ -> ‘saudi’, ‘arabia’ (taking the first two terms here because our model did not consider individual words to be a part of a phrase for now).
Checking if our model has learnt certain contextual analogies
  • We also have this method which works similar to an “odd one out” situation. Here, they our model identifies ‘grocery’, different from ‘covid’ and ‘coronavirus’.
‘grocery’ seems like the odd one out here

3. Visualising word embeddings

Word2Vec word embedding can usually be of sizes 100 or 300, and it is practically not possible to visualise a 300 or 100 dimensional space with meaningful outputs. I used this snippet from Stanford’s 224n course site, which basically provides you the option of either using a list of words, or providing the number of random samples you want to be displayed. In either case, it uses PCA to reduce the dimensionality and represent the word through their vectors on a 2-dimensional plane. The actual values of the axis are not of concern as they do not hold any significance, rather we can use it to perceive similar vectors to be densely located with respect to each other.

On the graph we have visualised, you can see how ‘coronavirus’, ‘covid’, ‘virus’ form one group, separate from others, and ‘paying’, ‘paid’, ‘bills’, ‘wages’ are in another group altogether. Similarly, the group of countries, ‘saudiarabia’, ‘kenya’, ‘pakistan’ form one very dense cluster.

Reducing dimensionality, and visualising the given words.

I’ve omitted the code of the graph from this article. You can check it out either at CS224N’s website I linked above or from my colab notebook.

4. Saving models, and how to use pre-trained models

Gensim comes with several already pre-trained models, in the Gensim-data repository. We can import the downloader from the gensim library. We can use the following method to print the list of pre-trained models trained on large datasets available to us. This also includes models like GloVe and fastext other than word2vec.

The gensim downloader! (Many model names cropped out due to space)

Here we have used ‘word2vec-google-news-300’ (trained on google-news, size is 300), and found words similar to ‘twitter’.

Using word2vec model trained from google-news dataset

We can save our existing models and load them again.

We can save, load and continue training our models!

Summary

  • Word embeddings are a better way to represent natural language as compared to a skeletal bag-of-words. They capture the semantic and syntactic relationships present in texts.
  • Word2Vec creates a representation of each word present in our vocabulary into a vector- effectively speaking similar words will have similar word vectors!
  • Word2Vec embeddings can be trained in two ways: CBOW predicts the target-word based on its surrounding words; Skipgram works in the exact opposite way to CBOW, predicting surrounding words for a given input word.
  • We can easily train word2vec word embeddings using Gensim, a free open source python library. We can train the embeddings on a given corpus of texts, or use pre-trained embeddings.
  • Gensim provides us with different functions to help us work with word2vec embeddings, including finding similar vectors, calculating similarities, and working with analogies.
  • Gensim downloader can be used to easily access word embeddings trained on large datasets like google news. We can even save our trained models and continue training them later by loading them.

Code: https://colab.research.google.com/drive/1YkSrvfWR_EBFFrhV5E15Z6k5es4Kluom?usp=sharing

References / Further Reading

https://israelg99.github.io/2017-03-23-Word2Vec-Explained/

https://jalammar.github.io/illustrated-word2vec/

https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial

https://remykarem.github.io/word2vec-demo/

--

--

Manan Suri

Computer Science Undergrad at Netaji Subhas University of Technology, New Delhi