Understanding Transformers Part 1

Dhruv Kabra
Version 1
Published in
10 min readJun 16, 2023

Transformer was the revolutionary paper that transformed the world of machine learning. To understand Transformer let’s see the proposed architecture in “Attention is all you need”.

Source : Vaswani et. al. (2013)

Let’s start by explaining each block one by one.

The first step is input embedding. What are they?

Since the computer only understands numbers or vectors, word embeddings are formed for each word. Why word embeddings are needed instead of one-hot encoding:

1. It creates huge vectors: Imagine you have a dictionary of 10,000 words. Now, for each word, you need a list of 10,000 items, with a ‘1’ at just one position and ‘0’s everywhere else. This is like carrying a big, heavy suitcase for just one tiny thing.

2. It wastes a lot of space: in the big list for each word, only one spot is used (the ‘1’), and the rest (the ‘0’s) are just empty. This is like having a giant parking lot with only one car.

3. It doesn’t understand word meanings. With one-hot encoding, the words “dog” and “car” are as different as “dog” and “puppy”. But we know that isn’t true, right? It’s like saying apples and oranges are as different as apples and basketballs.

4. It doesn’t consider the context: words can mean different things in different situations. For example, “bark” can mean a dog’s sound or the skin of a tree. One-hot encoding can’t understand this. It’s like seeing a picture but not understanding the story behind it.

The major issue with cosine similarity (that words like car and cat can be the same in space although they have significantly different meanings, hence cosine similarity of one hot encoding) has a lot of limitations as it is unable to catch:

Item2 and Item1 are closely related

Word embeddings are like a magic tool that helps computers to understand words, just like we humans do. They do this by looking at two main things:

  1. Semantics: This is all about the meaning of words. For example, in the sentence “The cat is lying on the floor and the dog is eating,” we can swap “cat” and “dog” to make a new sentence: “The dog is lying on the floor and the cat is eating.” The sentences still make sense because both cats and dogs are animals.
  2. Syntax: This is the grammar or structure of the sentences. The word embeddings understand that “dog” and “cat” play the same role in the sentence structure, even when swapped around.

To resolve this issue, word embeddings were used, with techniques such as Word2Vec.

Word2vec is an algorithm invented by Google for training word embeddings. word2vec relies on the distributional hypothesis. The distributional hypothesis states that words that, often have the same neighbouring words tend to be semantically similar.

Source : Mikolov et. al. (2013)

As you can see with the word2vec algorithm, the model is trained on the neighbourhood of similar words, hence, you can see the garden and hose are similar to each other, though their cosine similarity may be way off.

Mikolov et al. in 2013 proposed two models:

  1. Continuous Bag-of-Words Model
  2. Continuous Skip-gram Model

Continuous bag-of-words model

A Continuous Bag-of-Words or CBOW model basically takes ’n’ words before and after the target word (wt) and predicts the latter. n can be any number.

Suppose we have the sentence, “The dog is playing in the park”. We want our computer model to understand each word in this sentence and how it connects with the others. So we choose a small group of words around our target word, ‘playing’. If we set our group size (n) as 2, then our chosen group becomes: [‘the’, ‘dog’, ‘is’, ‘in’, ‘the’, ‘park’].

What’s great about this model is that it doesn’t waste time calculating the chances of all the words in our entire language showing up next. It only needs to consider the words in our small group, which makes it pretty quick and efficient. We can measure this efficiency using a fancy math formula, log2(V), where V is the total number of different words we know. The lower this value, the faster our model works. So, by focusing on a small group of words, we make things simpler and faster for our computer model.

import torch
import torch.nn as nn

import matplotlib.pyplot as plt
def make_context_vector(context, word_to_ix):
idxs = [word_to_ix[w] for w in context]
return torch.tensor(idxs, dtype=torch.long)

raw_text = """A long time ago in a galaxy far, far away...
It is a period of civil war."""

def CBOW(raw_text, window_size=2):
data = []
for i in range(window_size, len(raw_text) - window_size):
context = [raw_text[i - window_size], raw_text[i - (window_size - 1)], raw_text[i + (window_size - 1)], raw_text[i + window_size]]
target = raw_text[i]
data.append((context, target))

return data

class CBOW_Model(torch.nn.Module):
def __init__(self, vocab_size, embedding_dim):
super(CBOW_Model, self).__init__()

#out: 1 x emdedding_dim
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
self.linear1 = nn.Linear(embedding_dim, 128)
self.activation_function1 = nn.ReLU()

#out: 1 x vocab_size
self.linear2 = nn.Linear(128, vocab_size)

def forward(self, inputs):
embeds = sum(self.embeddings(inputs)).view(1,-1)
out = self.linear1(embeds)
out = self.activation_function1(out)
out = self.linear2(out)
return out

def get_word_emdedding(self, word):
word = torch.tensor([word_to_ix[word]])
return self.embeddings(word).view(1,-1)

model = CBOW_Model(vocab_size, EMDEDDING_DIM)
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

#TRAINING
for epoch in range(50):
total_loss = 0

for context, target in data:
context_vector = make_context_vector(context, word_to_ix)

log_probs = model(context_vector)

total_loss += loss_function(log_probs, torch.tensor([word_to_ix[target]]))

#optimize at the end of each epoch
optimizer.zero_grad()
total_loss.backward()
optimizer.step()

There are other techniques such Skip-gram but in-depth explanation of word embedding techniques is beyond the scope of this article.

You will ask why positional embedding matters

Even though she did not win the award, she was satisfied, earlier NLP techniques did not take positional embedding into context, imagine this piece of text Even though she did win the award, she was not satisfied.
As you can see, the meaning of the sentence changes, with the position of the words, so In the revolutionary paper Attention is all you need, they introduced a new set of vectors called the position vector.
So our word embedding vector can be added to our position embedding vector, In order to give each position a distinct representation, positional encoding represents the location or position of an entity in a sequence. There are a variety of reasons why an item’s position in transformer models is not represented by a single integer, such as the index value. The indices can become very large over long sequences. Variable length sequences may experience issues if the index value is normalized to fall between 0 and 1, as they would be normalized differently.
One intuitive way would be to use a fraction to encode positions, so, for example, a word at N position can be represented as 1/N-1. Here again, the problem is that sentences are different in length, and a position vector may have different positional vector numbers in a sentence for the same position contextually, which can confuse the model.

Let’s dig deeper into how position vectors can be represented, The author came up with the brilliant idea to represent the position vector as a function of frequency. Let’s try to explain more intuitively. Think of a clock (as cos and sin are just concepts from a unit circle). Every two dimensions of the positional embedding just specify one of the clock’s hands (the hour hand, the minute hand, or the second hand, for example). Then moving from one position to the next is just rotating those hands at different frequencies.
Let’s see this positional vector:

PE(pos,2i) = sin( pos / 1000 ^ 2I/d),

Here represents the size of the word embedding vector, pos represent the position of the word in a sentence, which could be 0 for the first word, and so on. The letter I represent is shown in the figure. So for different lengths of sentences, the frequency of the words will change, but their relative positions will remain the same that’s the beauty of the Sine wave. The authors of the original paper didn’t use only sine wave they used a combination of sine and cosine functions. Sine for odd positions and cosine for even positions.

As shown in the figure the positional embedding using this technique , relative position remain the same irrespective of sentence length. (source: @HEDUAI, via tumblr)
A positional vector for a single token in given sequenced input.

Problem with RNN which were SOA for machine translation for many years, For sequence-to-sequence modeling, sequences can be an ordered set of tokens, such as a set of words to form a sentence, and so these recurrent neural networks are the state of the art for sequence to sequence tasks, but they have two major drawbacks.

The first is that they are slow because we must feed these inputs one at a time in order to generate the outputs sequentially one at a time. We’re not sure if they truly represent the context of a word itself, after all, the context of a word depends on the words that come before it, as well as the words that come after it. But it’s very clear that from a recurrent neural network perspective and architecture, we’re only getting signals from the words that come just before it. Even bi-directional recurrent neural networks have an issue here because they only look from left to right. Also, RNN take words in order they were very slow to train.

Why Attention is important and intuitive:-

Let’s take a simple example, My name is Dhruv, if you see the word ‘my’ refers to the word ‘Dhruv’. This means ‘my’ has attention towards Dhruv. Suppose lets a build a matrix of attention

How Attention is associated between words.

Of course, arbitrary attention strength would be represented by a number to the computer but you get the idea each sentence has affinity towards a particular word.

Let’s deep dive into how multi-headed attention worked in the Transformer.

Let’s take the sentence in the picture below, words are tokenized and vectorized according to the word embeddings and positional embeddings.

Vectorization of Tokens, (source: @arkaung, via youtube)

Let’s now take the dot product of all possible vectors for a particular word, remember closer the words are, the higher will be their dot product. So another bank in this example might have the highest dot product. A softmax function normalizes the weight across different classes, used a lot in classification problems.

As you can see, Sn1, Sn2, and Sn3 are every possible dot product score as represented in the picture.

Taking Dot product of Each possible word Pairs as Shown. (source: @arkaung, via youtube)

These Scores are normalized by a Softmax function since dot product can give numbers which are not normalized. Let them be represented by w1 , w2 and wn vectors.

Let these be represented by

Y₁ =w₁₁v₁ + w₁₂v₂ + w₁₃v₃ + w₁ₙvₙ

Y₂ =w₂₁v₁ + w₂₂v₂ + w₂₃v₃ + w₂ₙvₙ

Yn =wₙ₁v₁ + wₙ₂v₂ + wₙ₃v₃ + wₙₙvₙ

These Y₁, Y₂, and Yn are contextualized representations of each word Token. Let’s Dig deeper into how are they used, the transformer devised three vectors which are:-

Query Vector: — What am I looking for?

Key Vector: — What I can offer?

Value Vector — What do I actually offer?

Let's try to explain the whole concept here K1, K2 and K3 represent the key Vector, and Q1 is the query vector, for the sake of convenience let’s imagine the key and Query vector as 1x50 and they all are multiplied by Mk which is 50x50 matrix. (This concept will be explained later). Then Q1 and Key vectors dot products are calculated, then those S21, S22 and S33 are normalized by a softmax function. Then those weights are added as shown in the figure below to form the Value Vector Y.

Let’s see how the original authors of the paper explained attention, as represented in the image below.

Source : Vaswani et. al. (2013)

Query and Key as shown is multiplied ( dot product). Then it’s scaled using the Mk Matrix as shown in the above figure. The mask is optional it’s not used in the encoder phase, it’s used in the decoder phase.

What is Scale in the diagram and what’s the intuition behind it?

Suppose you have to find a magnitude of vector as shown below:-

For a three-dimensional vector a=(a1,a2,a3), the formula for its magnitude is

a∥= sqrt (a21+a22+a23 ), if you recall the vector magnitude formula from your high school. So the magnitude of the current vector will be sqrt(3) x 2, if we divide with dₖ = sqrt(3), you cancel out the sqrt(3). You will ask why is this needed- remember you pass through a softmax function when if the number is huge, the gradient will be very low and probably the learning factor will decrease. The attention function can be summarized with three parameters — Query, Key and Value.

Source : Vaswani et. al. (2013)

Let’s make a simple attention code to reinforce our Q K and V learning.

# Generate Data
import numpy as np
import math

L, d_k, d_v = 4, 8, 8 # Suppose 4 is lenght of our Token - "My name is Dhruv"
q = np.random.randn(L, d_k)
k = np.random.randn(L, d_k)
v = np.random.randn(L, d_v)

# Dot product multiplicaion
np.matmul(q, k.T)

array([[ 1.9385252 , 5.43647918, -0.38370563, 1.24225801],
[ 1.35187753, 1.19807371, -1.70999851, -0.38129862],
[ 1.06382646, -0.86860778, -1.86251774, -0.68520405],
[ 2.21209236, -2.81995366, 5.32327746, 2.24049732]])


# we need sqrt(d_k) in denominator
q.var(), k.var(), np.matmul(q, k.T).var()

# As you can see variance is last number is high it might distort learning
(0.8672192297664698, 0.9229851723027697, 5.1446872979260165)

# Scaled Dot Product
scaled = np.matmul(q, k.T) / math.sqrt(d_k)
q.var(), k.var(), scaled.var()

I hope you get an understanding of the single Attention Mechanism in the transformer, to build multiple attention across different entities in Transformer the layers of the same concept build multiple attention and concatenates at the end as shown above.

About the Author:
Dhruv Kabra is a Python Developer here at Version 1.

--

--