How to use Pre-trained Word Embeddings in PyTorch
“For decades, machine learning approaches targeting Natural Language Processing problems have been based on shallow models (e.g., SVM and logistic regression) trained on very high dimensional and sparse features. In the last few years, neural networks based on dense vector representations have been producing superior results on various NLP tasks. This trend is sparked by the success of word embeddings and deep learning methods.” 
In this post we will learn how to use GloVe pre-trained vectors as inputs for neural networks in order to perform NLP tasks in PyTorch.
Rather than training our own word vectors from scratch, we will leverage on GloVe. Its authors have released four text files with word vectors trained on different massive web datasets. They are available for download here.
We will use “Wikipedia 2014 + Gigaword 5” which is the smallest file (“ glove.6B.zip”) with 822 MB. It was trained on a corpus of 6 billion tokens and contains a vocabulary of 400 thousand tokens.
After unzipping the downloaded file we find four txt files: glove.6B.50d.txt, glove.6B.100d.txt, glove.6B.200d.txt, glove.6B.300d.txt. As their filenames suggests, they have vectors with different dimensions. We pick the smallest one with words represented by vectors of dim 50 (“glove.6B.50d.txt”).
If we printed the content of the file on console, we could see that each line contain as first element a word followed by 50 real numbers. For instance these are the first two lines, corresponding to tokens “the” and “,”:
the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581
, 0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938 0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428 0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 -0.28351 3.5416 -0.11956 -0.014533 -0.1499 0.21864 -0.33412 -0.13872 0.31806 0.70358 0.44858 -0.080262 0.63003 0.32111 -0.46765 0.22786 0.36034 -0.37818 -0.56657 0.044691 0.30392
We need to parse the file to get as output: list of words, dictionary mapping each word to their id (position) and array of vectors.
Given that the vocabulary have 400k tokens, we will use bcolz to store the array of vectors. It provides columnar, chunked data containers that can be compressed either in-memory and on-disk. It is based on NumPy, and uses it as the standard data container to communicate with bcolz objects.
We then save the outputs to disk for future uses.
Using those objects we can now create a dictionary that given a word returns its vector.
For example, let’s get the vector for word “the”:
Comparing the numbers with the ones printed from the txt file we can verify that they are equals so the process has run properly.
What we need to do at this point is to create an embedding layer, that is a dictionary mapping integer indices (that represent words) to dense vectors. It takes as input integers, it looks up these integers into an internal dictionary, and it returns the associated vectors.
We have already built a Python dictionary with similar characteristics, but it does not support auto differentiation so can not be used as a neural network layer and was also built based on GloVe’s vocabulary, likely different from our dataset’s vocabulary. In PyTorch an embedding layer is available through torch.nn.Embedding class.
We must build a matrix of weights that will be loaded into the PyTorch embedding layer. Its shape will be equal to:
(dataset’s vocabulary length, word vectors dimension).
For each word in dataset’s vocabulary, we check if it is on GloVe’s vocabulary. If it do it, we load its pre-trained word vector. Otherwise, we initialize a random vector.
We now create a neural network with an embedding layer as first layer (we load into it the weights matrix) and a GRU layer. When doing a forward pass we must call first the embedding layer.