“you shall know a word by the company it keeps” — (Firth, J. R. 1957:11)
This is a famous quote by a British linguist, John Rupert Firth. He is popularly known for drawing attention to the fact that you can tell the meaning of a word by looking at other words in the same context in any given sentence. This means that words that can be used interchangeably in a sentence share similar meanings.
The idea of “context-dependent” meaning of words makes more sense when you look at homonyms. These are words that have different meanings in different contexts. An example is “bark” in the sentences below;
I hope her dog doesn’t bark when I knock on the door.
The tree bark is rough to the touch.
I love eating pretzels covered with almond bark.
In all three sentences, you can immediately tell the meaning of bark by looking at the words around it. In practice, you would usually select a window of surrounding words and try to infer the original word's meaning by looking at the selected words. For example, in the sentences above, we can see that the meaning of bark differs from sentence to sentence because of the surrounding words known as “context words,” and we would refer to “bark” as the “center word”
The thesaurus image below shows the synonyms of the word “bark” in different contexts. This means in the first sentence above; we can easily replace the word “bark” with “snarl,” “growl,” but not with “crust” because they mean completely different things despite being a synonym of “bark.”
Therefore, when modeling words into vectors, it is important to encode them in a manner that reflects their meaning in the contexts that they appear, and this is the intuition behind the Word2Vec algorithm.
Word2Vec is an NLP algorithm that encodes the meaning of words into short, dense vectors (word embeddings) that can be used for downstream NLP tasks such as Question Answering, Information Retrieval, Machine Translation, Language Modelling e.t.c. This vector(s) contextualizes the meaning of words in any given corpus by looking at the words surrounding that word in the corpus. This algorithm was introduced in 2013 by Mikolov et al. in their paper (“Efficient Estimation of Word Representations in Vector Space”).
Before word2vec was introduced, words were represented as sparse long vectors with dimensions the size of the entire vocabulary (total number of words) present in the training corpus. Examples of these traditional vectors include one-hot vectors, count vectorizers, Tf-Idf vectors, e.t.c. One of the major drawbacks of representing words as sparse vectors is establishing any form of relationship between words. This is because these vectors do not contain enough information about the words to demonstrate such syntactic or sematic relationships, e.g., one-hot vectors are orthogonal (perpendicular and have a dot product of 0) and thus cannot be used to measure any form of similarity.
As mentioned earlier, the intuition behind word2vec is to ensure that words that exist in similar contexts in sentences are mapped to the same vector space. This means that words with similar neighboring/surrounding/context words in a corpus have similar vectors(with high cosine similarity). Even more impressive is that the similarity of word embeddings goes beyond syntactic regularities; using simple algebraic operations, we can show more complex relationships between words. For example, the authors were able to establish that vector(‘King’) — vector(‘Man’) + vector(‘Woman’) results in a vector with similarity closest to the vector representation of “Queen.” (vector(‘Queen’)).
from gendim.models import KeyedVectors# load the google Word2Vec model
filename = 'GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(filename, binary=True)# vector algebra
result = model.most_similar(positive=['woman', 'king'], negative= ['man'], topn=1)print(result)
Result: [(‘queen’, 0.7118192315101624)]
The authors of this paper proposed 2 architectures for learning word representations. Continuous Bag of Words and Skip-Gram, shown below, are the two architectures used for training word2vec embeddings.
Continuous bag of words learns word representations by predicting a center word from a window of selected context words. In CBOW, we sample a window of context words around a particular word, feed it to the model, and predict the center word. In this particular architecture, the weight matrix between the input and the projection layer is shared between all the words. We map one-hot vectors of the input words (context words) onto the projection layer (embedding layer). The embedding layer of n-dimension is multiplied by another weight matrix to get the output layer. We run a softmax operation on the output layer to get a probability distribution over the words in the vocabulary.
Skip gram is a different variant of word2vec. Unlike CBOW, where we predict a center word based on context words in a vocabulary, Here we are trying to learn word vector representation by predicting the context words around a particular word. The model tries to maximize the classification of a word based on another word in the sentence. Ideally, the longer the dependency window, the better the quality of the word vectors. The authors also found that this increases complexity, and sometimes distant words are less related to the current word being modeled. The authors used a window size of 10 in their original paper for training, and results showed that the skip-gram model outperformed CBOW in several experiments.
In a different paper titled “Distributed Representations of Words and Phrases and their Compositionality,” Mikolov et al. introduced a different approach to training the skip-gram model, which significantly improves the training speed & quality of the word vector representations. It is popularly known as Negative Sampling. The original skip-gram model trained on a corpus of 320million words and an 82k vocabulary size lasted 8 weeks! but with negative sampling, there’s a significant reduction in training time and computational complexity.
So what is negative sampling?
In the original model, we have training weights of dimension (vocab_size x word_vector_dim); this means that the model size would be in billions depending on the vocabulary size. Therefore, we would be updating each of these parameters during each training step, which is computationally expensive. With negative sampling, for each training example (input word), we will sample a list of negative words (e.g., 5 vocabulary words that do not exist in the context of the input word selected).
We then use these “5” negative words alongside the output word to learn a classifier that outputs 1 for the output words and 0 for the negative words. This way, we are only updating the training weights for these 6 words while learning the word vector for the Input word during each iteration.
Although Word2Vec has been largely successful in NLP, there are more recent architectures such as ELMo, “Deep Contextualized Word Representations.” and BERT, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding; these models are able to create highly context-specific representations of word making it possible to generate proper representations for words that exist in multiple contexts in the same corpus and not static word vectors like word2vec.
Up next is a PyTorch implementation of skip-gram with negative sampling…