Demystifying Transformers: Word Embeddings

The Magic behind Language Understanding

8 min readJan 23, 2024

Source: https://en.wikipedia.org/wiki/Word2vec

This article is part of the series Demystifying Transformers.

Introduction

Imagine a world where machines can understand the nuances of human language, not just its syntax but also the subtle shades of meaning embedded within words. This dream is closer than ever thanks to a powerful technique called word embeddings. But what exactly are word embeddings, and how do they work their magic?

What are Word Embeddings?

Think of a word as a complex tapestry woven from threads of meaning and context. Word embeddings translate this tapestry into a numerical representation, a vector of numbers that captures the essence of the word. Words with similar meanings end up with similar vectors, residing close together in this numerical space. This allows machines to understand the relationships between words, opening doors to a whole new level of language processing.

The key takeaway is that even though the words themselves are different, their vectors encode their underlying semantic relationships. We can perform simple arithmetic operations on these vectors to reveal these relationships:

King - Man + Woman ≈ Queen

This equation, though simplified, demonstrates how word embeddings capture not just individual word meanings but also the relationships between them. This ability to model semantic analogies is what makes word embeddings so powerful for various natural language processing tasks.

Why Use Word Embeddings?

Traditional methods represent words as one-hot encoded vectors, where each word has a unique vector with all zeros except for a single “1” at its position. This approach suffers from two major drawbacks:

High dimensionality: Large vocabularies lead to huge vectors, making computations slow and inefficient.
Lack of semantics: One-hot vectors capture no inherent meaning, making it difficult for machines to understand relationships between words.

Word embeddings overcome these limitations by:

Dimensionality reduction: They compress words into lower-dimensional vectors, making them more efficient and manageable.
Semantic meaning: The vector captures the meaning of the word, allowing machines to perform tasks like synonym detection, sentiment analysis, and machine translation.

How are Word Embeddings Calculated?

There are several algorithms for calculating word embeddings, with word2vec being one of the most popular. Word2vec analyzes a large corpus of text and identifies patterns in how words co-occur with each other. Based on these patterns, it assigns a unique vector to each word, such that words with similar contexts have similar vectors.

Word2vec utilizes a shallow, two-layer neural network. To understand this better, let’s dive deeper into the structure and function of these two layers in the context of Word2vec’s Skip-Gram architecture. Skip-Gram is a Word2Vec architecture where a model predicts context words from a target word, as in the example: given “apple,” it predicts related words like “fruit,” “red,” and “juicy.”

Skip-Gram neural network, source: https://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

Basic Structure

Input Layer:

This is the first layer where the input to the model is presented. A single one-hot encoded vector representing the target word.

Hidden Layer:

This layer doesn’t perform any complex calculations. It acts as a transfer station for the data, projecting the input from a high-dimensional space (one-hot encoding) to a lower-dimensional, continuous space (the word embedding space). The number of neurons in this layer corresponds to the desired dimensionality of the word embeddings. There are no activation functions applied in this layer in the Word2vec model.

You might be wondering about the hidden layer “That one-hot vector is almost all zeros… what’s the effect of that?” If you multiply a 1 x 10,000 one-hot vector by a 10,000 x 300 matrix, it will effectively just select the matrix row corresponding to the “1”. This means that the hidden layer of this model is really just operating as a lookup table. The output of the hidden layer is just the “word vector” for the input word. Here’s a small example to give you a visual.

Hidden Layer, source: https://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

Output Layer:

This layer is where the final calculation is done to predict context words. The output is typically a softmax or, for efficiency, a hierarchical softmax or negative sampling.

Here’s an illustration of calculating the output of the output neuron for the word “car”.

Output Layer, source: https://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

Training

The weights between the input and hidden layers become the word embeddings.
During training, these weights are adjusted to reduce the error in predicting the target word (CBOW) or context words (Skip-Gram).
The training process involves optimizing a loss function, typically through backpropagation and an optimization algorithm like Stochastic Gradient Descent (SGD).

Implementing word2vec from Scratch

Here’s a simplified Python code snippet demonstrating the basic idea of word2vec. The code is available in this colab notebook:

import numpy as np

class Word2Vec:
    def __init__(self, corpus, embedding_dim, window_size):
        # Initial settings
        self.corpus = corpus  # Input corpus of sentences
        self.embedding_dim = embedding_dim  # Size of the word embedding vectors
        self.window_size = window_size  # Context window size
        self.sentences = self.tokenize_corpus()  # Tokenize the corpus into words
        self.vocab = self.build_vocab()  # Create a vocabulary from tokens
        self.vocab_size = len(self.vocab)  # Number of unique words in vocab
        self.training_data = self.generate_training_data()  # Generate training pairs

        # Initialize weights randomly
        # W1: Weight matrix from input layer to hidden layer
        # Shape: [vocab_size, embedding_dim]
        # This matrix stores the word embeddings (weights) for each word in the vocabulary
        self.W1 = np.random.rand(self.vocab_size, self.embedding_dim)
        print('W1 shape:', self.W1.shape)
        # W2: Weight matrix from hidden layer to output layer
        # Shape: [embedding_dim, vocab_size]
        # This matrix helps in predicting the context words in the output layer
        self.W2 = np.random.rand(self.embedding_dim, self.vocab_size)
        print('W2 shape:', self.W2.shape)

    # Tokenize each sentence in the corpus into words
    def tokenize_corpus(self):
        sentences = [sentence.split() for sentence in self.corpus]
        return sentences

    # Build a vocabulary dictionary mapping each unique word to an index
    def build_vocab(self):
        vocab = {}
        for sentence in self.sentences:
            for word in sentence:
                if word not in vocab:
                    vocab[word] = len(vocab)
        print(vocab)
        return vocab

    # Convert a word to its one-hot encoded vector
    def word_to_one_hot(self, word):
        one_hot_vector = np.zeros(self.vocab_size)
        one_hot_vector[self.vocab[word]] = 1
        return one_hot_vector

    # Generate training data: pairs of target words and context words
    def generate_training_data(self):
        training_data = []
        for sentence in self.sentences:
            sentence_length = len(sentence)
            for i, word in enumerate(sentence):
                w_target = self.word_to_one_hot(word)
                for j in range(i - self.window_size, i + self.window_size + 1):
                    if j != i and j >= 0 and j < sentence_length:
                        w_context = self.word_to_one_hot(sentence[j])
                        training_data.append([w_target, w_context])
        return np.array(training_data)

    # Train the model
    def train(self, epochs, learning_rate):
        for epoch in range(epochs):
            for x, y in self.training_data:
                # Forward pass
                h = np.dot(self.W1.T, x)  # Hidden layer: input -> hidden
                u = np.dot(self.W2.T, h)  # Output layer: hidden -> output
                y_pred = self.softmax(u)

                # Calculate error
                # Error is the difference between the predicted probability and actual output
                e = y_pred - y

                # Backward pass
                # The gradients are calculated for each weight
                dW2 = np.outer(h, e)  # Gradient for W2
                dW1 = np.outer(x, np.dot(self.W2, e))  # Gradient for W1

                # Update weights
                # Adjust the weights by a fraction of the gradient defined by the learning rate
                self.W1 -= learning_rate * dW1
                self.W2 -= learning_rate * dW2

    def softmax(self, x):
        # Apply the softmax function to convert logits to probabilities
        # The softmax function exponentiates each element of the input vector x
        exp_x = np.exp(x - np.max(x))  # Subtract max for numerical stability

        # The exponentiated values are then divided by the sum of all exponentiated values
        # This normalization ensures that the sum of the probabilities is 1
        return exp_x / exp_x.sum(axis=0)

    def word_vector(self, word):
        if word not in self.vocab:
            raise ValueError(f"Word '{word}' not in vocabulary.")
        word_index = self.vocab[word]
        return self.W1[word_index]

    def cosine_similarity(self, vec1, vec2):
        return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

    def most_similar(self, word, top_n=5):
        # Get the vector for the specified word
        word_vec = self.word_vector(word)

        # Compute similarities with all other words
        similarities = {}
        for other_word in self.vocab:
            if other_word != word:
                other_vec = self.word_vector(other_word)
                similarity = self.cosine_similarity(word_vec, other_vec)
                similarities[other_word] = similarity

        # Sort by similarity
        sorted_similarities = sorted(similarities.items(), key=lambda x: x[1], reverse=True)

        # Return the top N most similar words
        return sorted_similarities[:top_n]


# Example usage
corpus = [
    "I am good",
    "I am fine",
    "I am great",
    "you look good",
    "you look great",
    "this is a good example",
    "this is a great example",
    "the dog is running after the ball",
    "the cat and the dog love the ball"]
embedding_dim = 100
window_size = 2
w2v = Word2Vec(corpus, embedding_dim, window_size)
w2v.train(epochs=1000, learning_rate=0.01)

# Get the embedding of a word
word_index = 0  # Index of the word in the vocabulary
word_embedding = w2v.word_vector('example')
print(word_embedding)

similar_words = w2v.most_similar('good', top_n=3)
print('Word similar to good:', similar_words)

Output:

W1 shape: (28, 100)
W2 shape: (100, 28)
[ 2.02489092e-01  7.87826053e-01  2.12122495e-01  5.49380617e-01
 -1.23953243e-01  5.08364189e-01  5.91681159e-01  6.12269661e-01
  7.52795694e-01  1.68013259e-01  5.77403382e-01  5.38774329e-01
  3.34214330e-01  6.44525857e-01  2.64736492e-01  5.82739463e-01
  3.96980663e-01  7.06752423e-01  1.01392307e-01  9.08294874e-01
  6.42693592e-01  3.69503839e-01  1.77432265e-01 -7.83970808e-02
  2.69390523e-01  6.04354901e-01  4.09035829e-01  6.89237200e-01
  2.01979587e-01  2.97523184e-01  3.06296989e-01  6.72962789e-01
  7.54014187e-01  7.81821373e-01  9.80171511e-02  8.12635833e-01
  1.45524960e-01 -5.97935284e-02  1.89211154e-01  4.47557534e-01
  6.60666557e-01  6.83456539e-01 -4.80685994e-01  8.69786030e-01
 -1.38245244e-01  4.49565065e-01 -1.38461497e-01  2.67919060e-01
  3.44476747e-01  2.18233931e-01  6.66441738e-01  6.04339866e-01
 -4.64484939e-02  4.24076679e-01  3.45853469e-01  7.76911366e-02
  3.47163257e-01 -2.19237290e-01  3.22161043e-01 -3.41669900e-01
  5.64418121e-01  1.93853809e-01  1.88587366e-01  3.54927038e-01
  2.32597130e-01  4.55817458e-02 -1.75579628e-05 -9.12571817e-02
  5.62141213e-01  3.99111478e-01  5.28767896e-02  5.27965302e-01
  2.30041924e-01 -1.93457227e-01  4.69790009e-01  5.72684559e-01
  4.75133151e-01  4.53629268e-01 -2.28444279e-01  4.15058636e-02
 -1.02829206e-02  4.66743210e-01  1.65841396e-01  1.00376741e+00
  2.61169527e-01  4.14299641e-01  3.85583026e-01  7.24166072e-01
  5.85466492e-01  1.67735428e-01  9.51088576e-01  2.58980119e-01
  1.90506875e-02  6.98980051e-01  5.68714149e-01  4.89752914e-01
  3.85786790e-01  4.61709453e-01  5.15426900e-01  3.79242735e-01]
Word similar to good: [('great', 0.731331766286727), ('fine', 0.6897219267602511), ('last', 0.676365588361946)]

Conclusion

By demystifying word embeddings, we open a window into the fascinating world of how machines can understand and process language. As this technology continues to evolve, it promises to revolutionize how we interact with computers and unlock new possibilities for communication and understanding.

References

Tomas Mikolov et al., “Distributed Representations of Words and Phrases and their Compositionality” (2013)
TensorFlow Word2Vec
https://projector.tensorflow.org/
Word2Vec Tutorial — The Skip-Gram Model
Jay Alammar, “The Illustrated Word2Vec” (blog post): http://jalammar.github.io/illustrated-word2vec/