Basics of Using Pre-trained GloVe Vectors in Python

Downloading, loading, and using pre-trained GloVe vectors

Sebastian Theiler
Analytics Vidhya
7 min readSep 7, 2019

--

Content

This article will cover:
* Downloading and loading the pre-trained vectors
*
Finding similar vectors to a given vector
*
“Math with words”
*
Visualizing the vectors

Further reading resources, including the original GloVe paper, are available at the end.

Brief Introduction to GloVe

Global Vectors for Word Representation, or GloVe, is an “unsupervised learning algorithm for obtaining vector representations for words.” Simply put, GloVe allows us to take a corpus of text, and intuitively transform each word in that corpus into a position in a high-dimensional space. This means that similar words will be placed together.

If you would like a detailed explanation of how GloVe works, linked articles are available at the end.

Downloading Pre-trained Vectors

Head over to https://nlp.stanford.edu/projects/glove/.
Then underneath “Download pre-trained word vectors,” you can choose any of the four options for different sizes or training datasets.

I have chosen the Wikipedia 2014 + Gigaword 5 vectors. You can download those exact vectors at http://nlp.stanford.edu/data/glove.6B.zip (WARNING: THIS IS A 822 MB DOWNLOAD)

I cannot guarantee that the methods used below will work with all of the other pre-trained vectors, as they have not been tested.

Imports

We’re going to need to use, Numpy, Scipy, Matplotlib, and Sklearn for this project.
If you need to install any of these, you can run the following:

pip install numpy
pip install scipy
pip install matplotlib
pip install sklearn

Depending on your version of Python, you may need to substitute pip for pip3.

Now we can import the parts we need from these modules with:

Loading the Vectors

Before we load the vectors in code, we have to understand how the text file is formatted.
Each line of the text file contains a word, followed by N numbers. The N numbers describe the vector of the word’s position. N may vary depending on which vectors you downloaded, for me, N is 50, since I am using glove.6B.50d.

Here is an example line from the text file, shortened to the first three dimensions:

business 0.023693 0.13316 0.023131 ...

To load the pre-trained vectors, we must first create a dictionary that will hold the mappings between words, and the embedding vectors of those words.

embeddings_dict = {}

Assuming that your Python file is in the same directory as the GloVe vectors, we can now open the text file containing the embeddings with:

with open("glove.6B.50d.txt", 'r', encoding="utf-8") as f:

Note: you will need to replace glove.6B.50d.txt with the name of the text file you have chosen for the vectors.

Once inside of the with statement, we need to loop through each line in the file, and split the line by every space, into each of its components.

After splitting the line, we make the assumption the word does not have any spaces in it, and set it equal the first (or zeroth) element of the split line.

Then we can take the rest of the line, and convert it into a Numpy array. This is the vector of the word’s position.

Finally, we can update our dictionary with the new word and its corresponding vector.

As a recap for our full code to load the vectors:

Keep in mind, you may need to edit the method for separating the word from the vectors if your vector text file includes words with spaces in them.

Finding Similar Vectors

Another thing we can do with GloVe vectors is find the most similar words to a given word. We can do this with a fancy one-liner function as follows:

This one’s complicated, so let’s break it down.
sorted takes an iterable as input and sorts it using a key. In this case, the iterable that we are passing in is all possible words that we want to sort. We can get a list of such words by calling embeddings_dict.keys().

Now, since by default Python would sort the list alphabetically, we must specify a key to sort the list the way we want it sorted.
In our case, the key will be a lambda function that takes a word as input and returns the distance between that word’s embedding and the embedding we gave the function. We will be using euclidean distance to measure how far apart the two embeddings are.

scipy has a function for measuring euclidean distance under its module spatial, which we imported earlier. So our final sorting key turns into:

lambda word: spatial.distance.euclidean(embeddings_dict[word], embedding)

Now if we want to rank all words by closeness to a given word, let’s say “king,” we can use:

find_closest_embeddings(embeddings_dict["king"])

This, however, will print every word, so if we want to shorten it we can use a slice at the end, for the closest, let’s say five words.

find_closest_embeddings(embeddings_dict["king"])[:5]

Since the closest word to a given word will always be that word, we can offset our slice by one.

find_closest_embeddings(embeddings_dict["king"])[1:6]

Using my vectors, glove.6B.50d,

print(find_closest_embeddings(embeddings_dict["king"])[1:6])

prints: [‘prince’, ‘queen’, ‘uncle’, ‘ii’, ‘grandson’]

The reason we take an embedding directly, instead of transforming a word into an embedding, is so that when we add and subtract embeddings, we can find the closest approximate words to an embedding, not just a word. We can do this, even if the embedding does not lie entirely on any word.

Math with Words

Now that we can turn any word into a vector, we can use any math operation usable on vectors, on words.

For example, we can add and subtract two words together, just like numbers. i.e., twig-branch+hand ≈ finger

The above code prints “fingernails” as its top result, which is certainly passable as logical.

Visualizing the Vectors

Nothing helps to find insights in data more than visualizing it.

To visualize the vectors, we are first going to be using a method known as t-distributed stochastic neighbor embedding, also known as t-SNE. t-SNE will allow us to reduce the, in my case, 50 dimensions of the data, down to 2 dimensions. After we do that, it’s as simple as using a matplotlib scatter plot to plot it. If you would like to learn more about t-SNE, there are a few articles linked at the end.

sklearn luckily has a t-SNE class that can make our work much more manageable. To instantiate it, we can use:

tsne = TSNE(n_components=2, random_state=0)

n_components specifies the number of dimensions to reduce the data into.
random_state is a seed we can use to obtain consistent results.

After initializing the t-SNE class, we need to get a list of every word, and the corresponding vector to that word.

words =  list(embeddings_dict.keys())
vectors = [embeddings_dict[word] for word in words]

The first line takes all the keys of embeddings_dict and converts it to a list.

The second line uses list comprehension to obtain the value in embeddings_dict that corresponds to each word we chose, and put that into list form.

We can also manually specify words so that it will only plot individual words. i.e., words = [“sister”, “brother”, “man”, “woman”, “uncle”, “aunt”]

After getting all the words we want to use and their corresponding vectors, we now need to fit the t-SNE class on our vectors.
We can do this using:

Y = tsne.fit_transform(vectors[:1000])

If you would like, you can remove or expand the slice at the end of vectors, but be warned; this may require a powerful computer.

After the t-SNE class finishes fitting to the vectors, we can use a matplotlib scatter plot to plot the data:

plt.scatter(Y[:, 0], Y[:, 1])

This alone isn’t very useful since it’s just a bunch of dots. To improve it we can annotate the graph by looping through each X Y point with a label and calling plt.annotate with those X Y points and with that label. The other inputs to the function are for important formatting. Annotation in Matplotlib

Finally, we can show the plot with,

plt.show()
A bit crowded, but you can still see correlations.
Zoomed in scatter plot of 1000 words
Zoomed in

This may lag on less powerful computers, so you can either choose to lower the numbers of words shown, by changing vectors[:1000] to something more like vectors[:250], or change words to a list of your own making.

Scatter plot of words: finger, hand, twig, branch
words = [“branch”, “twig”, “finger”, “hand”]

Full code for visualizing the vectors:

Conclusion

The full code is available in Jupyter Notebook and Python file format on my GitHub here.

Only so much can be done with the pre-trained GloVe vectors. For higher-level usage, I would recommend consulting the official README for training your own vectors.

One important usage that was not mentioned above is loading an embedding layer at the start of a Natural Language Processing model, with these vectors. That would, in theory, considerably increase the accuracy of the model, and save time training a new embedding from scratch.

Further Reading:

Papers:
Original GloVe Paper: https://nlp.stanford.edu/pubs/glove.pdf
Original t-SNE Paper: http://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

More detailed GloVe explanations:
*https://mlexplained.com/2018/04/29/paper-dissected-glove-global-vectors-for-word-representation-explained/
*https://blog.acolyer.org/2016/04/22/glove-global-vectors-for-word-representation/

More detailed t-SNE explanations:
*https://mlexplained.com/2018/09/14/paper-dissected-visualizing-data-using-t-sne-explained/
*https://distill.pub/2016/misread-tsne/

--

--

Sebastian Theiler
Analytics Vidhya

This account is inactive now; thank you to everyone who has read my pieces! I’m so glad I could share some knowledge about AI and data science. I might return…