Understanding Transformers

9 min readDec 5, 2021

How do they work? Pt-2 “WTH is a transformer”

A meme from https://www.meme-arsenal.com/

We know transformers are outperforming in many NLP as well as CV tasks producing SOTA (State of the Art), today in the second episode of “WTH, is a Transformer” we will try to understand what gives them this enormous Power.

Recap:

In the last episode of “WTH, is a Transformer” series we’ve discussed:

Why do we need a Transformers architecture?
The intuition behind basic Attention architecture and its types.
Problems with RNNs and LSTMs.
A quick overview of Transformers architecture.

If you are new here check out the previous Episode, https://medium.com/@ai.paperdeck/but-wth-is-a-transformer-dbd6cf3700b5

Transformers In-Detail

From the previous episode, we know that

Transformers don't process text sequentially like RNNs rather they process it as sets of text embeddings.
Transformers consist of an Encoder and a Decoder
To overcome the problem of lack of positional information of words the Authors introduced “Positional encodings”.
We learned about the basic intuition behind the Attention Mechanism.

Alright, let's begin from here

The goal of the model (Task)

To make our understanding clear let us imagine that our task is to do question answering and we have a training sample, with the Input sentence “Who is Elon Musk” and our desired output sentence is “He is the CEO of SpaceX”. We will use this example to explain the whole architecture.

Tokenization

We, all know that Deep Learning Models can only understand numerical, So we need a method to convert text data into numerical and here comes Tokenizers to help us.

Using Tokenizers we Tokenize our word sentence and assign numerical integers to the tokenized words that are stored in “Tokenizer vocabulary”.

An image sampled from Medium representing Tokenizer Vocabulary

A Tokenizer Vocabulary looks like this, It will have words as well as their word index.

For example, in this vocabulary the word “black island” after tokenization will have values [2409,5282] this will become our input to the transformers architecture and the same for outputs.

But, How Do we process words that are not in our vocabulary?

Well, we use techniques like The “BPE” tokenizer and “Wordpiece” tokenizer to solve this problem. The basic goal of the BPE tokenizer (Byte Pair Tokenizer) is to subdivide words that are not in vocabulary into words that are in vocabulary and combine them.

We will discuss BPE tokenizer in-depth in upcoming episodes.

In addition to this <bos> and <eos> tokens are added at the beginning and end of the text to indicate the beginning and end of the sequence.

We will understand why we need this soon.

Word Embedding and Positional Embedding

Now we are done with our preprocessing step i.e Tokenization. now let's understand what happens when we pass in these “Tokenized indexes”.

Keep in Mind Transformers don't process these inputs One by one like RNN

Embedding Layer:

The tokenized inputs are sent into the Embedding layer which is simply a Linear Layer that converts the inputs into a Tensor having a shape (N,emdb_D) Where N is the number of Tokenized word indexes that we give in and embd_D is the dimension of embedding to which our inputs are transformed.

An implementation of Embedding layer in Pytorch.

# Embedding module containing vocabulary words size 100 and embd_dimesion of 3embedding = nn.Embedding(100, 3)# a batch of 2 samples of 4 indices each
# Consider input as the text "Who is Elon Musk" processed by tokenizerinput = torch.LongTensor([[1,2,4,5]]) #Tokenized result
embedding(input)#Result after passing through embedding layer.
>>>tensor([[[-0.0251, -1.6902,  0.7172],
         [-0.6431,  0.0748,  0.6969],
         [ 1.4970,  1.3448, -0.9685],
         [-0.3677, -2.7265, -0.1685]]])

Note that these Embedding Layers are trainable so, transformers backpropagate to Update the Parameters of the Embedding Layer.

As you can see this approach lacks in terms of Positional information when we Process all embedding together So, How do we solve them?

Our messiah Positional Encodings

Positional encodings are vectors that carry Positional Information that is added to the Word Embeddings To get a Vector that contains “Word information” as well as “positional Information”.

So, How do we get a Positional Encoded Vector?

To Answer this Question let’s start from a simple, naive approach to a sophisticated approach.

Approach 1

Let’s make it simple by creating a Vector by counting the number of embeddings and will simply add and also concatenate.

From our example “Who is Elon musk” naive Positional Encoding for it will look like this [0,1,2,3] where each positional vector corresponds to each word.

While they may look good, They have their problems:

A graph illustrating problems with the Naive Positional encoding approach

Adding these Positional vectors to word embeddings would make the Distance between related word embeddings larger.
Concatenating them would create an additional dimension to the word embeddings and can cause unstable training.

So, We need a really small Vector.

Approach 2

Normalizing the counted vectors by dividing with sequence_length to produce a smaller magnitude vector. But, here we get a problem as sequence length varies, “Positional Encodings” Varies.

The encodings for a sequence_length, in our case it is 4, for the sentence “Who is Elon Musk” would be [0,0.25,0.5,0.75] and varies over sequence length.

Now we know we need a Vector that is smaller in magnitude, continuous and constant w.r.t “sequence_length” let’s take these as our criteria.

Original Approach That solves the problem

The authors of “Attention is all you need” came up with a clever solution of using sine and cosine wave function i.e “Sinusoidal Positional Encodings” because they are continuous as well as smaller in magnitude and also constant w.r.t “sequence_length”.

But if you are aware of trigonometric functions then you would realize that sines and cosines are repetitive and the same values are repeated over a period.

Here you can see that sin(0) is equal to 𝝅, 2𝝅, 3𝝅, and so on.

We don't need this sort of rapid repetition so, we decrease its frequency by a large extent.

By which we will increase wavelength and have less repetition of values.

Here it is,

We now know what are the criteria for Positional Encodings Now, let us do some math.

Don’t worry about the above formula will cover that in a movement, PE here refers to Positional Encoding given the position and Dimension “i” of the vector.

firstly, let’s have a look into the sine wave function.

In the above formula “pos” refers to the position of a word in the Sequence “i” refers to “i”th dimension and “2i” refers to “2i th” dimension of the embedding and “d” refers to the total dimension of the embedding.

Note, pos and i vary and d is fixed.

Understanding Graphically

image from https://www.youtube.com/watch?v=dichIcUZfOw

In the Positional Encoding formula, we divide position by 10000 and have an exponent “i” that helps in scaling lower frequency as “i” change over-dimension of the encoding.

When we do this you can see that smaller frequency produces smaller sine values which is one of our requirements.

Now let’s apply this formula

In our example “Who is Elon Musk” after Tokenizing the sentence let us assume that we got a matrix [1,2,4,5] we need to create a positional encoding for this matrix.

Note, Positional encodings matix shape should be same as embedding matrix size in order to add them.

For every even “i th” dimension we take positional encoding with sine wave, for every odd “i th” dimension we take cosine positional encodings.

From the Embedding layer example, we will set our dimension “d” to 3 and we have 4 positions we need to produce a 4x3 matrix.

Positional encoding of the matrix at position “pos” 1 and dimension “i” will be 0.

Dimension “i” moves from zero to maximum dimension “d” in our case d = 2

Then, Positional_Encoding at dimension 0 will be,

We calculate this using sine because “i” is even here. Similarly, we do this for every odd value of “i” with cosine waves.

Positional_Encoding at position 1, dimension 1 will be,

We do this process for every “i” dimensions and all positions.

What we get is a (pos,d) matrix here it is a 4x3 matrix, This matrix is added with word embeddings to get both positional and context information.

Word Embeddings are trainable where as Sinuasoidial Positional embeddings are not trainable because they don't need any parameters that are to be learnt

Overview of encoder and decoder

In the above image on the left side, you can see an “Encoder” and on the left, you can see a “Decoder”.

Encoder tries to Encode Text embeddings in a way to produce an “Attention score matrix” which is a context-rich representation of words in the sequence, and then the attention score is used by Decoder to produce a probability distribution with a softmax activation function and is optimized with a Cross-Entropy loss function.

Encoder Architecture

We now Understood the mathematical concepts behind positional embeddings and how it is added with Word embeddings, What we learned as of now are Preprocessing Steps. We will now understand the mechanism of the Encoder Architecture.

Inside the Encoder, you can see that we have a multi-head-Attention layer, feed-forward neural network, and some residual connections.

Let’s try to understand them one by one,

Multi-Head-Attention

From the previous blog, we know that Attention is a sort of mechanism that helps the model learn what words to pay attention to (focus) in a sequence. We also had a look into an example of an attention heatmap. Now let us try to understand how Transformers create these Attention_score_matrices. Mathematically, we are creating a mapping between words in a sequence.

The goal of Encoder is to create a more meaningful representation of words in the sequence, the input Embeddings (combined with positional encodings) are passed through 3 Linear or Dense layers named Query (Q), KEY(K), Value(V) these are just different “manifestation” or “representation” of our input embeddings.

One Simplest way to create a mapping of words is by linear multiplication, and that is what we do in Self-Attention which is given by the formula,

Self Attention Image from jay Alamar's blog

Here we Multiply Query (Q) with Key’s transpose (K) to create mappings for each word in the sequence and scale it by the square root of the dimension of K (In our case it will be √3 where 3 is the dimension of K matrix) and apply the softmax activation function to get a probability distribution, and we scale the distribution by multiplying with Value (V).

As a result of this process we get a Scaled dot Product Attention matrix, Stacking of Many of them would create a Multi-Head Attention layer, where the output of each dot-product matrix is “concatenated” and aggregated by passing them to a Linear Layer (Here This Linear Layer is Trainable and has parameters)

To Normalize the results produced by the Multi-Head Attention block over batches, we add a batch normalization layer. Also, to make sure that we don't encounter vanishing gradients we add a “Residual Connection”.

More Parameters

Now it’s all set and we will pass outputs of Attention block to a deep pointwise feed-forward network increasing the model’s learnable parameters i.e complexity.

A pointwise feed Forward networks are simply, a feed-forward network with a Relu activation in between the layers.

What we end up is a “single Encoder Block”, many encoder blocks are produced by a factor of “N” So that multiple Encoder layers are stacked on top of each other to produce Attention scores.

End of Episode

Hurray, that is what an encoder is all about We will discuss “Decoder block” in Detail with intuition in the next episode of “WTH, is a Transformer”

Good bye, from Jeydev 👋