Word2Vec -Negative Sampling made easy
This is my second post on Word2Vec. The previous article was about the probabilistic model explaining the mechanics of embedding and appropriately using vector representation. You can find it here. In this part, we will review and implement skig -gram and negative sampling (SGNS) which is a more efficient algorithm for finding word vectors.
SGNS is one of the most popular approaches for word embedding. Dense and distributed representation of words as vectors leads to many applications in NLP. Mikolov and the team of researchers developed the technique in 2013 at Google. SGNS is a refined model that improves the quality of representation and speed of computation. The main difference from earlier CBOW (Continuous Bag of Words) model is that the skip-gram model is designed to predict the context given a word where as in CBOW learns to predict the word by the context. This work was published in the paper ‘Distributed Representations of Words and Phrases and their Compositionality’.
As in the first part consider a text consisting of characters from ‘a’ to ‘z’ sequentially [a, b,c, d, e, f, g, h, i, j, k,l, m, n, o, p, q, r,s, t, u, v, w, x, y, z], further let integers 0 to 25 represent the respective letters. Implementing skip-gram keeping the window size of 2, we can see that ‘a’ is related to ‘b’ and ‘c’, ‘b’ to ‘a’, ‘c’ and’ ‘d’ and so forth. It follows that one hot vector representing input words will be of dimension [26, 1]. With the help of the model, we will find the dense and distributed vector of size [10, 1]. The size of the embedding vector is selected arbitrarily.
The diagram below illustrates the context and target words.
It is easy to visualize the pattern of relationship. When they occur together within a span of 2 characters they are related or co-occurrence is true. The pattern is depicted in table and map below.
The key feature of negative sampling is 2 embedding weight matrices. The first fully connected layer (FC1 -below) transforms input words to the embedding vector and the second weight matrix (FC2) transforms the target words to context vectors. The vector product of a pair of vectors representing input and target words is a score which passes through sigmoid activation. Model learns by comparing the final output with the target labels.
Backpropagation and update optimize the parameters/weights in both embedding matrices FC1 and FC2. Once fully trained -the whole model learns to predict the possible target words. However, the real purpose of training is to extract the embedding that is learned by the model and these vectors represent the dense and distributed representation of each word. Finally, the embedding vectors are used to find the similarities and other applications in NLP. First weight matrix(FC1) embed words as input and FC2 embeds words as targets.
Unlike the probabilistic model of Word2Vec where for each input word probability is computed from all the target words in the vocabulary, here for each input word has only few target words (few true and rest randomly selected false targets). The burden of computation is much lower and the model learns the embedding vector by contrasting true class with few random false targets. Also, random targets are not exhaustive and the model might never see all the negative targets. The key difference compared to the probabilistic model is the use of sigmoid activation as final discriminator replacing softmax function in the probabilistic model. (Refer to Word2Vec made easy for details of the probabilistic model)
The model is illustrated below.
#Generating input and true target sub-lists
import numpy as np
import torch.nn as nn
import matplotlib.pyplot as plt
#Characters from 'a' to 'z' is represented by integers 0 to 25
#Pairing input words with true targets and labeling this pair as 1
true_list = 
for i in range(26):
#Get the target indices given each input index
temp = 
a = i- 2
b = i - 1
c = i + 1
d = i + 2
temp.extend([a, b, c, d])
#Sub-lists of input and target and 1 as label
for j in range(4):
if temp[j] >=0 and temp[j] <=25:
true_list.append([i, temp[j], 1])
#[[0, 1, 1], [0, 2, 1], [1, 0, 1], [1, 2, 1], [1, 3, 1]]
#Generating random(false) pairs and labeling them 0
#There are 98 true pairs, keeping size 400 selects more than 300 #random pairs or roughly 3 random targets for each input
def gen_rand_list(size = 400):
#true targets are filtered here as size of vocab is too small
false_list = 
for i in range(size):
frs = random.sample(range(26),1 )
sec = random.sample(range(26),1 )
if abs(frs - sec) > 2 or (frs == sec):
false_list.append([frs, sec, 0])
#Concatenating both lists (True and False pairs) followed by getting one-hot vectors of both input and targets
#Joining and shuffling true and random input/target pairs
joint_list = np.concatenate((np.array(true_list), np.array(gen_rand_list())), axis = 0)
inp_targ_labels = torch.Tensor(joint_list).long()
#Converting both inputs and targets to one-hot forms
#Two tensors are initialized as 0 tensors
#The item in i_th row whose index is equal to corresponding
#input/target is then replaced by 1
middle_word_arr = torch.zeros(inp_targ_labels.shape, 26)
sur_word_arr = torch.zeros(inp_targ_labels.shape, 26)
for i in range(len(inp_targ_labels)):
middle_word_arr[i, inp_targ_labels[i, 0]] = 1
sur_word_arr[i, inp_targ_labels[i, 1]] = 1
labels = inp_targ_labels[:, 2].float()
return (middle_word_arr, sur_word_arr, labels)
import torch.optim as optim
#2 fully connected layers with NO bias
#Embedding dimension is 10
#Rows of each weight matrix represents the embedding
#Sigmoid is implemented using loss criterion (nn.BCELoss())
fc_inp_word = nn.Linear(26, 10, bias = False)
fc_targ_word = nn.Linear(26, 10, bias = False)
LR = 0.001
criterion = nn.BCELoss()
params = list(fc_inp_word.parameters()) + list(fc_targ_word.parameters())
optimizer = optim.Adam(params, lr = LR)
epochs = 10000
print_every = 1000
#In skip-gram middle word becomes the input which predicts #surrounding words(targets)
#Every time one_hot_auto() is called fresh batch is generated
mid_hot, sur_hot, labels = one_hot_auto()
for i in range(epochs):
#Forward prop to get hidden layer
z_midl = fc_inp_word(torch.Tensor(mid_hot))
z_sur = fc_targ_word(torch.Tensor(sur_hot))
#Initialize a 1d matrix of 0s to store dot products between each
#row of first hidden matrix embedding input with second hidden
#matrix embedding target words
#This score forms the basis for optimization
dot_u_v = torch.zeros(mid_hot.shape, 1)
for j in range(len(z_midl)):
dot_u_v[j, :] = z_midl[j, :] @ z_sur[j, :]
#Sigmoid activation applied to dot products of vectors
desired_logits = dot_u_v
sig_logits = nn.Sigmoid()(desired_logits)
#Back prop and stepping
loss = criterion(sig_logits, torch.Tensor(labels).view(sig_logits.shape, 1))
if i % print_every == 0:
#Scheduled one_hot_auto() to generate fresh random pairs
mid_hot, sur_hot, labels = one_hot_auto()
Visualizing the outputs
Real world example
For experimentation, I selected the book ‘world order’ by Henry Kissinger and got the word embedding using the exact algorithm as above. Here is the pseudo-code for processing the book and finding vectors.
#Psuedo-code for implementation
#Download the book in pdf
#Convert it to csv(I find working with csv simpler)
#import csv --> read the book
#import re --> convert whole text to lower case, remove digits and punctuation
#from collections import Counter --> Count word frequency and find most common words
#stop_words = Most frequent words (for example: top 20), add or remove more words as necessary
#uniq_words --> From the corpus above find set of unique words
#Indices of uniq_words as used as look up from words to integers
#Build a word gram comprising input word and its true target and label it as 1
#Build another list consisting of input words, random indices as false targets and label 0
#Join the both lists and shuffle and feed in batches
#fc1 = nn.Linear(len(uniq_words), embed_size, bias = False)--> represents input words
#fc2 = nn.Linear(len(uniq_words), embed_size, bias = False)--> represents the target words or contexts
#criterion, optimizer and network as described in the post
#Finding distances --> find index of any word in uniq_words -->
nn.CosineSimilarity(dim = 0)(fc1.weight.t()[idx, :], fc1.weight.t()[i, :])
For quick implementation, only cpu was used and ran the network for a couple of hours. The window size was 5, meaning the target word for input are 5 words before and 5 words after the selected word. K or negative samples was just 2(2 random pairs were given for every true pair, and it was intentionally kept low to see if it is still able to extract the meaningful representation).
Find similar words
find_dist(‘nuclear’, 5): [‘nuclear’, ‘proliferation’, ‘age’, ‘challenge’, ‘khomeini’]
find_dist(‘mankind’, 5): [‘mankind’, ‘acting’, ‘perspective’, ‘longer’, ‘concept’]
find_dist(‘henry’, 5): [‘henry’, ‘kissinger’, ‘history’, ‘reflections’, ‘character’]
find_dist(‘khomeini’, 10): [‘khomeini’, ‘statecraft’, ‘iranian’, ‘tradition’, ‘approaches’, ‘iran’, ‘nuclear’, ‘vision’, ‘proliferation’, ‘revolution’]
The embedding outputs similar words that are quite consistent given the nature of the text.
Here is the github link to full code for implementation:https://github.com/LakheyM/word2vec/blob/master/word2vec_SGNS_git.ipynb
Adjusted sampling: here frequency of each word is adjusted by some power (generally 3/4). Drawing a random negative sample according to adjusted frequency instead of unigram distribution improves the quality of embedding probably due to increased probability for selecting less frequent words. To implement adjusted sampling — add or remove words from corpus as per the difference between two frequencies(unigram vs adjusted). Shuffle, index and finally take k random indices as negative samples. The equations are given below.
Hierarchical softmax: It is an alternative to negative sampling. Just like negative sampling it improves computational efficiency and the cost is only O(log(|V|)), which correspond to the depth of the path from the root to a leaf node. This binary tree representation of words in the vocabulary is the key feature of Hierarchical softmax.
Unique path links from root to leaf and each node is associated with a vector that is learned by the model. The vector product and sigmoid activation give balanced probabilities on either side of the node. Finally, likelihood is computed by multiplying all the probabilities in the path. In the picture below, the probability of a word w given a vector C, P(w|C), is equal to the probability of a random walk starting in the root and ending in the leaf node corresponding to w.
Word2Vec and similar algorithms are the key steps in implementing all sorts of NLP tasks. With the use of negative sampling it is possible to get the representation of whole data relatively quickly and/or using less computing power. Perhaps the biggest advantage of implementing embedding from the scratch is that it gives immense flexibility and control while developing end to end NLP applications.
 Mikolov, Tomas, Ilya Sutskever, Kai Chen, Gregory S. Corrado and Jeffrey Dean. “Distributed Representations of Words and Phrases and their Compositionality.” NIPS(2013).
 CS224n: Natural Language Processing with Deep Learning, Winter 2017. ( https://tensorflowkorea.files.wordpress.com/2017/03/cs224n-2017winter-notes-all.pdf)
Also check out: Ali Ghodsi, Lec 13: Word2Vec Skip-Gram