Implementing Glove from scratch — Word Embedding for Transformers

Published in

Nerd For Tech

3 min readJun 23, 2023

In Understanding Transformers Step by Step — Word Embeddings, we learnt about the significance of word embeddings and why they are used.

In today’s article we will go in depth of yet another learned word embedding — GloVe (Global Vectors for Word Representation) is an algorithm for learning word embeddings that aims to capture the global co-occurrence statistics of words in a corpus.

Here’s a step-by-step explanation of the GloVe algorithm:

Constructing Co- occurence Matrix:
Glove starts by constructing a co-occurrence matrix from a large corpus. The co-occurrence matrix represents how frequently words co-occur in a given context window.
The size of the co-occurrence matrix is determined by the vocabulary size, where each row and column corresponds to a specific word.

import numpy as np
from collections import defaultdict

corpus = [    "I love chocolate",    "I love ice cream",    "I enjoy playing tennis"]
# Initialize vocabulary and co-occurrence matrix
vocab = set()
co_occurrence = defaultdict(float)

window_size = 4
# Iterate through the corpus to build vocabulary and co-occurrence matrix
for sentence in corpus:
    words = sentence.split()
    for i in range(len(words)):
        word = words[i]
        vocab.add(word)
        for j in range(max(0, i - window_size), min(i + window_size + 1, len(words))):
            if i != j:
                co_occurrence[(word, words[j])] += 1.0 / abs(i - j)

Here in the last line, we have used co_occurrence[(word, words[j])] += 1.0 / abs(i — j), The expression 1.0 / abs(i - j) is used to calculate the co-occurrence value. Basically, the co-occurence matrix is keeping track of the words existing together in the corpus, so that means their indices (i, j) are a measure of distance between them, the smaller the distance — the better the coherence or the better their relationship with each other, so we take the reciprocal of it to implement that.
Now that we have our co-occurence matrix, we jump to the next step i.e
to initialize word embeddings:

embedding_dim = 10
word_embeddings = {
    word: np.random.randn(embedding_dim) for word in vocab
}

Here, as you see we can define the embedding dimensions i.e it is a hyper-parameter, it basically defines the number of features that you want your word to have.

Now, comes the last step to this basic implementation of Glove, and you guessed it! We just need to train the word embeddings.
To do that, first we need a loss function, here we have just initialized word embeddings to be random numbers, now if we take the cosine similarity between two word embeddings using the dot product between the two, and then calculate the mean squared loss between the current similarity and the actual value using the co-occurence matrix, using this loss we can update our word embeddings using the negative gradient ( as the concept of maximum likelihood states).

learning_rate = 0.1
num_epochs = 100

# Gradient descent to update word embeddings
for epoch in range(num_epochs):
    total_loss = 0
    for (word_i, word_j), observed_count in co_occurrence.items():
        # Calculate dot product of word embeddings
        dot_product = np.dot(word_embeddings[word_i], word_embeddings[word_j])
        
        # Calculate difference and update
        diff = dot_product - np.log(observed_count)
        total_loss += 0.5 * diff**2
        gradient = diff * word_embeddings[word_j]
        word_embeddings[word_i] -= learning_rate * gradient
        
    print(f"Epoch: {epoch+1}, Loss: {total_loss}")

Now that we have the updated word embeddings, we repeat the process for n number of epochs. At the end of it, we will have our updated word embeddings which can then be used for a wide variety of NLP tasks such as Machine Translation, Paraphrasing etc.,

So, this was the implementation of Glove algorithm for word embeddings from scratch.
Note — This was just a basic implementation of the algorithm, better results can be achieved by scaling the co-occurence values and representing them as probabilities, this will reduce the degree of freedom in the data, resulting in a better fit.

Colab Notebook — https://colab.research.google.com/drive/1IxAnnFSqk3mL3A8n1PKYWdEzDSd2Y9rF?usp=sharing

Keep Learning!

Implementing Glove from scratch — Word Embedding for Transformers

Written by Deepanshusachdeva