Transformers from scratch part — 2

18 min readNov 29, 2023

In the last blog we had a basic introduction to attention and we went through some terminologies used in the paper. In this blog, I will give a comprehensive explanation of the encoder part of a attention based encoder-decoder transformer introduced in the paper, with a full implementation in python and pytorch.

But before starting, It is important that you are quite well versed with python and the pytorch framework and is also important that you are aware of all basic concepts related to NLP.

prerequisites:

Intermidiate Python.
Pytorch (Deep Learning Framework).
Basic NLP concepts.

The Transformer architecture:

The above architecture, is the architecture that is given in the research paper and it acts as the base for all kind of LLMs that we see today such as GPT and LlaMA. In this blog we will go throught the encoder part of the architecture with implementation. Now In the transformer architecture the encoder is responible for capturing relevant information from word embeddings and providing a context vector (of same dimension as the input) as the output.

The Encoder part of the transformer follows several stages to achieve this, but the prime features or blocks of the encoder are as follows.

Positional Encoding
Multi-Head Attention
Feed Forward Network
Residual Connections (Add)
Layer Normalization

In the last blog, I gave a high level explanation of multi-head attention, In this blog we will dive into a deep intuition and Implementation of the same.

Word Embeddings:

Word Embeddings are used as the input to the encoder. Word Embeddings are nothing but a high level contextualized and vectorized form of text with little context based correlation. I am not going to discuss about Word Embeddings, as I have already explained about it indetail in previous blogs.

In the context of the paper the input is of the shape [ n, 512 ], where

n — Number of word embeddings, defined by the number of tokens in a sentence.
d_model — in this case 512, it is the dimenion of the word embeddings for each word.

Attention:

Now we have arrived at the most important part of any transformer — Attention. As we saw in the last blog, Attention can be said as a similarity or context metric that measures the contextual relation between two words in a sentence. It is a mechanism in NLP that allows a model to focus on different parts of the input when processing information. The idea is inspired by human attention, where we selectively concentrate on specific aspects of our environment while processing information.

QKV:

The main components are the Query, Key and Value pairs. The Query, Key, Values are nothing but a linear transformation of the input word embeddings encoded with positional encodings(which I will explain soon, for now assume it simply adds something to the input embeddings to make sense of position of each word). When I mean linear transformation, I mean it is an output of a fully connected layer i.e. a linear layer Query, Key and Value vectors are optimized by adjusting their weight parameters during training.

The Query, Key and Value vectors are of the same dimensions as the input embeddings. [ n , 512 ] where,

n — Number of word embeddings, defined by the number of tokens in a sentence.
d_model — in this case 512, it is the dimenion of the word embeddings for each word.

Now lets see how Query, Key and Value can be calculated (For this simple example I am not going to use a linear layer)

L = 4
q = np.random.randn(L,512)
k = np.random.randn(L,512)
v = np.random.randn(L,512)
# Here L is the number of Word Embeddings, and 512 represents the dimension of
# each word embeddings... In this example we generate Query(q), Key(k), Value(v)
# vectors by randomply sampling from the normal distribution.
print(q,k,v)
print(q.shape, k.shape, v.shape)

q:
output
array([[-1.27770593,  0.03672528, -1.46536834, ..., -0.71041898,
         0.04947546, -0.29105608],
       [-0.42066666, -2.05813691,  0.20246402, ..., -1.10751744,
        -0.02410065,  0.35876849],
       [ 0.16383739,  0.16409655, -0.8262307 , ...,  0.17555454,
         0.06995999,  0.25468763],
       [-0.89421063,  0.69227657, -0.47519354, ..., -0.74691532,
        -0.87534768,  0.6575931 ]])
k:
array([[-0.8294968 , -0.75119896, -1.33610499, ..., -0.64288954,
         0.10495949, -0.17834863],
       [ 0.1575462 ,  0.70438204, -0.42744993, ..., -0.49763529,
        -1.68921132, -1.39565574],
       [-0.0944206 , -0.99666137,  0.48791437, ..., -0.00250844,
         0.26089402,  0.1661775 ],
       [-0.131246  ,  1.15164104,  1.30330714, ..., -0.15484835,
        -0.09791072, -2.25439309]])
v:
array([[-0.72236162, -0.95946675,  1.719967  , ...,  1.09807886,
         1.69707081, -1.7295294 ],
       [-0.36299911,  0.54129707, -1.90237278, ..., -0.19949992,
         2.29075396, -0.41495035],
       [-0.50139198, -2.33392621,  0.18808599, ..., -0.16915482,
        -0.44726068,  1.01416103],
       [-0.06032585, -0.11314428,  0.88582239, ...,  1.37514303,
        -0.317385  ,  2.0226732 ]])

shape:
((4, 512), (4, 512), (4, 512))

The Query and Key are used to calculate the similarity metric and matrix product is done with Value vector to get the attention.

Scaled Dot Product:

The dot product between Q and K vectors is used to calculate the similarity score between each and every word in a sentence , and scaling is done to ensure that issues related to variance does not arise, and softmax is applied to get the probability distribution of the similarity score, so later it can be multiplied with the V vector to get the attention.

Q — represents the Query vector which contains the word embeddings for each word, for which you want to calculate attention score.

K — represents the Key vector which contains all the words you want to compare words in the Query vector with.

Q,K are of same shape [N, d_model] but have different weights.

similarity = Q.(Transpose(K))

np.dot(q,k.T)

array([[-36.0775864 ,   8.65167429,  11.74661448,   1.47797425],
       [ 16.96995999,   3.52950887,   8.59227202, -28.10359651],
       [  2.64462919,  -6.84866843,  -6.56834508, -32.65924308],
       [-35.35088011,  28.07837447,   4.11255417,  48.04031965]])

This allows you to product every Q with every other K to get the similarity score between each word.

Now the similarity score is of shape [N, N]

We scale the score by the square root of d_model, this is necessary because Q, K and V have similar variance, but when Q.(Transpose(K)) is performed the resultant variance is shifted higher leading to a drastic difference between the variance of the similarity score and V vector.

similarity = Q.(Transpose(K))/√(d_model)

print(q.var() , k.var(),  np.dot(q,k.T).var(), v.var())
scaled =  np.dot(q,k.T)/math.sqrt(d_k)
print(scaled.var())

(0.9412183158502878(q), 0.9761431726981737(k), 512.410630177069(q.transpose(k)),
 1.0096111750421992(v))
1.000802012064588(scaled)

Now we calculate the softmax of the above equation to get the final similarity score,

V — represents the Value vector with which the similarity score is multiplied to get the attention values.

Now attention is calculated by multiplying the similarity score with the Value vector V.

attention = softmax(similarity).V

def softmax(x):
  return (np.exp(x).T/np.sum(np.exp(x),axis = -1)).T
att = softmax(scaled)
att = np.matmul(att,v)
print(att)

array([[-0.08976781, -0.08429105, -0.20217436, ...,  0.39974678,
         0.58329796, -0.04130168],
       [-0.35953674, -0.39419822,  0.26979537, ...,  0.39010005,
         0.28514806,  0.35475273],
       [-0.27807631, -0.31013953,  0.28492715, ...,  0.38306489,
         0.39332261,  0.36831027],
       [ 0.23664868,  0.52424767,  0.00580336, ...,  0.75514499,
         1.36531607,  0.52680713]])

We finally get the output as a [N, d_model] attention values.

Multi-Head Attention:

Multi-Head Attention involves using multiple set of attention values to increase the contextual awareness of the transformer. It involes i sets of QKV vectors calculating attention in parallel and then concatenating them to get the final attention.

import torch
import torch.nn as nn
import torch.nn.functional as fun

sentence_length = 4
batch_size = 1
#input_dim = 512
d_model = 512

#we generate an input word embedding using the torch random normal generator
inp_x = torch.randn((batch_size,sequence_length,input_dim))
print(inp_x.size())

#Out:<torch.Size([1, 4, 512])> -->Here the first dimension 1 refers to the batch
# and the second dimension 4 refers to the number of sentences and the 3rd dimension
# refers to the dimension of each word embeddings.

#Next we use a linear layer to create the three base vectors that we will use to 
#create the query, key and value.
qkv_layer = nn.Linear(input_dim,3*d_model)
qkv = qkv_layer(inp_x)
print(qkv.size())

#Out: <torch.Size([1, 4, 1536])> --> Here the 1st dimension refers to the batch
#size, 2nd dimension refers to the number of sentences per batch and the 3rd
#dimension refers to the size of the linear output projection which can be split
#into 3 for Q, K, V.

#Multi-Head Attention involves using multiple attention heads to capture varied
#levels of contextual information from the input sentences. The paper presented in 2017
#Used 8 heads so we use 8 heads here, you may increase or decrease the number of
#heads based on your needs.
heads = 8
head_dim = d_model//heads 
qkv = qkv.reshape(batch_size,sequence_length,heads,3*head_dim)# splitting the last dimension into for 8 heads.
print(qkv.shape)

#Out: <torch.Size([1, 4, 8, 192])> --> Here the 1st and 2nd dimension refers to the 
#batch_size and sequence length and the 3rd dimension refers to the number of heads and
#the 4th dimension refers to the linear input dimension for each attention head.

#the permute() method helps to switch dimensions.
qkv = qkv.permute(0,2,1,3)
print(qkv.shape)

#Out: <torch.Size([1, 8, 4, 192])> --> As you can see the second and third dimensions
#are switched, now each head processes the sequences in parallel.

#The chunk method helps to divide a given dimension into n different chunks
q,k,v = qkv.chunk(3,dim = -1)
print(q.shape,k.shape,v.shape)

#Out: <(torch.Size([1, 8, 4, 64]), torch.Size([1, 8, 4, 64]), torch.Size([1, 8, 4, 64]))>
#As you can see the input tensor is divided along the 4th dimension by 3, Each for the 
#Q, K, V vectors.


#We perform Scaled Dot Product, we use transpose to ensure the dimensions during matrix
#multiplication matches.
d_k = q.shape[-1]#64
scaled = torch.matmul(q,k.transpose(-2,-1)) / math.sqrt(d_k)
print(scaled.shape)

#Out: <torch.Size([1, 8, 4, 4])> is the output attention for all the 8 heads

#Now we calculate the softmax for the attention metric to scale the Value vector
#and pass it into a linear layer to get the final output.
atten = fun.softmax(sca,dim = -1) #Applying attention 
values = torch.matmul(atten, v) #scale the Value Vector
print(values.shape)

#Out: <torch.Size([1, 8, 4, 64])> is the scaled output.

#We reshape the output as it was before the attention block, i.e. [batch_size, sequence_length, d_model]
out = values.reshape(batch_size,sequence_length,heads*head_dim)
print(out.shape)

#Out: <torch.Size([1, 4, 512])> 

#We pass it into a linear layer for more contextual awareness.
lin = nn.Linear(d_model,d_model)
out = lin(out)
print(out.shape)

#Out: <torch.Size([1, 4, 512])>

Now we have seen the detailed explanation of the Multi-Head Attention mechanism with the implementation, I will give the code for the Attention block below.

class MultiHeadAttention(nn.Module):
  def __init__(self,d_model,num_heads):
    super(MultiHeadAttention,self).__init__()
    self.d_model = d_model
    self.num_heads = num_heads
    self.head_dim = d_model//num_heads
    self.qkv = nn.Linear(d_model,3*d_model)
    self.linear_layer = nn.Linear(d_model,d_model)

  def _scaled_Dot_Product_Attention(self,Q,K,V,mask = None):

    d_k = Q.size()[-1]
    scaled = torch.matmul(Q,K.transpose(-1,-2))/math.sqrt(d_k)

    if mask:
      scaled += mask
    attention = functional.softmax(scaled,dim = -1)
    values = torch.matmul(attention, v)
    return values,attention

  def forward(self,x,mask = None):
    batch_size,seq_len,d_model = x.size()
    qkv = self.qkv(x)
    qkv = qkv.reshape(batch_size, seq_len, self.num_heads, 3 * self.head_dim)
    qkv = qkv.permute(0,2,1,3)
    q,k,v = qkv.chunk(3,dim = -1)
    values,attention = self._scaled_Dot_Product_Attention(q,k,v,mask = mask)
    values = values.reshape(batch_size,seq_len,self.num_heads*self.head_dim)
    out = self.linear_layer(values)
    return out

Positional Encoding

When it comes to traditional RNN architectures such as LSTM or BiLSTM the input to the data is passed sequentially over time, this helps in inherent positional encoding in these architectures, but when it comes to transformers we pass the input at the same time, thus if not taken care of the context that comes with positions of each word is lost, this can lead to various problems such as bad contextual encoding, Without positional encoding, transformers would treat each word independently, rendering them incapable of understanding the context and meaning of sentences.

Importance of positional encoding:

Understanding Word Order
Context-Aware Embeddings

Various Methods for Positional Encoding:

There are several methods for applying positional encoding, each with its own strengths and weaknesses. Commonly used methods include:

Sinusoidal Positional Encoding: This method uses sinusoidal functions to encode positional information, allowing the model to capture long-range dependencies effectively.
Learned Positional Encoding: In this approach, the model learns positional encodings directly from the data, allowing it to adapt to specific language patterns.
Relative Positional Encoding: This method focuses on encoding relative positions between words, rather than their absolute positions, making it more efficient and memory-friendly.

The paper uses alternating sine and cosine functions of different frequencies to encode the input embeddings before it enters the attention block.

PE(pos,2i) = sin(pos/(10000^(2i/d_model))) →1

PE(pos,2i+1) = cos(pos/(10000^(2i/dmodel))) →2

Now, Let’s have a look at the code.

#Here "i" helps us to get the alternating positions in each input_embeddings.
i = torch.arange(0,d_model,2).float()
print(i)

#Now this is the even positions in each embeddings and we apply eq<1> to it and 
#apply eq<2> to the odd positions.


tensor([  0.,   2.,   4.,   6.,   8.,  10.,  12.,  14.,  16.,  18.,  20.,  22.,
         24.,  26.,  28.,  30.,  32.,  34.,  36.,  38.,  40.,  42.,  44.,  46.,
         48.,  50.,  52.,  54.,  56.,  58.,  60.,  62.,  64.,  66.,  68.,  70.,
         72.,  74.,  76.,  78.,  80.,  82.,  84.,  86.,  88.,  90.,  92.,  94.,
         96.,  98., 100., 102., 104., 106., 108., 110., 112., 114., 116., 118.,
        120., 122., 124., 126., 128., 130., 132., 134., 136., 138., 140., 142.,
        144., 146., 148., 150., 152., 154., 156., 158., 160., 162., 164., 166.,
        168., 170., 172., 174., 176., 178., 180., 182., 184., 186., 188., 190.,
        192., 194., 196., 198., 200., 202., 204., 206., 208., 210., 212., 214.,
        216., 218., 220., 222., 224., 226., 228., 230., 232., 234., 236., 238.,
        240., 242., 244., 246., 248., 250., 252., 254., 256., 258., 260., 262.,
        264., 266., 268., 270., 272., 274., 276., 278., 280., 282., 284., 286.,
        288., 290., 292., 294., 296., 298., 300., 302., 304., 306., 308., 310.,
        312., 314., 316., 318., 320., 322., 324., 326., 328., 330., 332., 334.,
        336., 338., 340., 342., 344., 346., 348., 350., 352., 354., 356., 358.,
        360., 362., 364., 366., 368., 370., 372., 374., 376., 378., 380., 382.,
        384., 386., 388., 390., 392., 394., 396., 398., 400., 402., 404., 406.,
        408., 410., 412., 414., 416., 418., 420., 422., 424., 426., 428., 430.,
        432., 434., 436., 438., 440., 442., 444., 446., 448., 450., 452., 454.,
        456., 458., 460., 462., 464., 466., 468., 470., 472., 474., 476., 478.,
        480., 482., 484., 486., 488., 490., 492., 494., 496., 498., 500., 502.,
        504., 506., 508., 510.])

pos = torch.arange(sequence_length, dtype = torch.float)/
.reshape(sequence_length, 1)
denominator = torch.pow(10000,2*i/d_model) #Denominator is same for both the equation
print(denominator.shape)

#Out: <torch.Size([256])> --> used to create alternating positional encoding.

#even_PE and odd_PE gives alternating encodings for odd and even positions
even_PE = torch.sin(pos/denominator) 
odd_PE = torch.cos(pos/denominator)

print(even_PE.shape)
print(odd_PE.shape)
 #Out: <torch.Size([4, 256])>,<torch.Size([4, 256])>

#Now we can stack these together to get the positional encoding for each position in a embedding
stacked = torch.stack([even_PE,odd_PE],dim=2)
PE = torch.flatten(stacked,start_dim = 1,end_dim = 2)
print(PE.shape)

#Out: <torch.Size([4, 512])> --> Now as we see we get the right shape as the output embeddings

#Now we add the positional encoding to the input that we assumed before
out = torch.add(inp_x,PE)
print(out.shape)

#Out: <torch.Size([1, 4, 512])>

Now we can pass the “out” as the input to the attention block. Below I will give the code for the entire positional encoding part, I have also added some linear layers as explained previously in Learned Positional Encoding to get more feature, but it is not part of the implementation given in the paper.

class PoisitionwiseFeedForward(nn.Module):

  def __init__(self, d_model, hidden, sequence_length, drop_prob = 0.1):
    super(PoisitionwiseFeedForward,self).__init__()
    self.linear1 = nn.Linear(d_model,hidden)
    self.linear2 = nn.Linear(hidden,d_model)
    self.relu = nn.ReLU()
    self.dropout = nn.Dropout(p = drop_prob)
    self.sequence_length = sequence_length

  def _get_encoding(self):
    pos = torch.arange(self.sequence_length, dtype = torch.float).reshape(self.sequence_length, 1)
    i = torch.arange(0,d_model,2).float()
    denominator = torch.pow(10000,2*i/self.d_model)
    even_PE = torch.sin(pos/denominator)
    odd_PE = torch.cos(pos/denominator)
    stacked = torch.stack([even_PE,odd_PE],dim=2)
    PE = torch.flatten(stacked,start_dim = 1,end_dim = 2)
    return PE
  
  def forward(self, x):
    x = torch.add(self._get_encoding(),x)
    x = self.linear1(x)
    x = self.relu(x)
    x = self.dropout(x)
    x = self.linear2(x)
    return x

Layer Normalization:

Layer normalization is a technique that stabilizes the training of deep neural networks by normalizing the activations of each layer. It is a key component of the Transformer architecture.

How Layer Normalization Works

Layer normalization works by normalizing the activations of each layer to have a mean of zero and a standard deviation of one. This is done by computing the mean and standard deviation of the activations across each feature dimension, and then subtracting the mean and dividing by the standard deviation.

The following equation shows how layer normalization is computed:

x' = γ * (x - μ) / σ + β

where:

x is the input vector
μ is the mean of the input vector
σ is the standard deviation of the input vector
γ and β are learnable parameters

Benefits of Layer Normalization

Improved training stability: Layer normalization can help to stabilize the training of deep neural networks by preventing the activations from exploding or vanishing.
Better performance: Layer normalization can improve the performance of deep neural networks on a variety of tasks.
Reduced sensitivity to hyperparameters: Layer normalization can make deep neural networks less sensitive to hyperparameters such as learning rate.

Why Layer Normalization is Used in Transformers

Layer normalization is used in Transformers because it is well-suited to the architecture of Transformers. Transformers are made up of a stack of encoder and decoder layers, where each layer consists of a self-attention mechanism and a feed-forward network. Layer normalization is applied to the output of the self-attention mechanism in each layer.

Position-wise Feed-Forward Network with Pre-Layer Normalization

In the original Transformer architecture, layer normalization is applied after the feed-forward network. This is known as post-layer normalization. However, some researchers have found that it is better to apply layer normalization before the feed-forward network. This is known as pre-layer normalization.

Pre-layer normalization has been shown to improve the performance of Transformers on a variety of tasks. It is thought that this is because pre-layer normalization allows the feed-forward network to see the normalized activations of the self-attention mechanism, which can help it to learn better representations of the input data.

class LayerNormalization(nn.Module):
  def __init__(self,parameter_shape,eps = 1e-5):
    super().__init__()
    self.parameter_shape = parameter_shape
    self.eps = eps
    self.gamma = nn.Parameter(torch.ones(parameter_shape))
    self.beta = nn.Parameter(torch.zeros(parameter_shape))

  def forward(self,inputs):
    dims = [-(i+1) for i in range(len(self.parameter_shape))]
    mean = inputs.mean(dim = dims, keepdims = True)
    var = ((inputs-mean)**2).mean(dim = dims, keepdim = True)
    std = (var+self.eps).sqrt()
    y = (inputs - mean) / std
    out = self.gamma * y + self.beta
    return out

Thus we are done with the main components of a transformer’s encoder architecture now we just have one more topic to talk about.

Residual Connection and Fully Connected Block:

Residual connections are a technique that has been widely used in deep learning to improve the performance of neural networks. They were first introduced in the paper “Deep Residual Learning for Image Recognition” by He et al. (2015) and have since been shown to be effective for a variety of tasks, including natural language processing (NLP).

In the context of transformers, residual connections are used to connect the inputs and outputs of encoder and decoder layers. This helps to ensure that the network is able to learn from the long-range dependencies that are present in natural language.

There are two main types of residual connections that are commonly used in transformers:

Pre-layer normalization (Pre-LN): In this type of residual connection, layer normalization is applied to the input of the residual block before the feedforward network.

Post-layer normalization (Post-LN): In this type of residual connection, layer normalization is applied to the output of the residual block after the feedforward network.

The choice of whether to use Pre-LN or Post-LN is often dependent on the specific task and dataset. However, Pre-LN has been shown to be more effective for some tasks, such as machine translation.

Residual connections have been shown to be an effective way to improve the performance of transformers. They help to avoid the vanishing gradient problem, which can make it difficult to train deep neural networks. Additionally, residual connections can help to improve the stability of training and make the network less prone to overfitting.

Here are some of the benefits of using residual connections in transformers:

Reduces the vanishing gradient problem: The vanishing gradient problem is a phenomenon that can occur in deep neural networks, where the gradients of the loss function become very small as you go deeper into the network. This can make it difficult to train the network, as the updates to the weights will be very small. Residual connections help to mitigate the vanishing gradient problem by directly adding the input of a layer to its output. This means that the gradients of the input can propagate through the network even if the gradients of the layer itself are very small.
Improves the stability of training: Residual connections can help to improve the stability of training by making the network less prone to overfitting. Overfitting occurs when a network learns the training data too well and is unable to generalize to new data. Residual connections can help to prevent overfitting by making the network less sensitive to the specific details of the training data.
Improves the overall performance of the network: Residual connections have been shown to improve the overall performance of transformers on a variety of tasks. For example, residual connections have been shown to improve the performance of transformers on machine translation, natural language understanding, and question answering.

In the Encoder we apply residual connections at two points, One after Multi-Head Attention block and another after the Fully connected layer, The fully connected layer helps to furthur contextualize the data.

Full Implementational code for the transformer architecture:

import torch
import torch.nn as nn
import torch.nn.functional as functional
import math

class MultiHeadAttention(nn.Module):
  def __init__(self,d_model,num_heads):
    super(MultiHeadAttention,self).__init__()
    self.d_model = d_model
    self.num_heads = num_heads
    self.head_dim = d_model//num_heads
    self.qkv = nn.Linear(d_model,3*d_model)
    self.linear_layer = nn.Linear(d_model,d_model)

  def _scaled_Dot_Product_Attention(self,Q,K,V,mask = None):

    d_k = Q.size()[-1]
    scaled = torch.matmul(Q,K.transpose(-1,-2))/math.sqrt(d_k)

    if mask:
      scaled += mask
    attention = functional.softmax(scaled,dim = -1)
    values = torch.matmul(attention, v)
    return values,attention

  def forward(self,x,mask = None):
    batch_size,seq_len,d_model = x.size()
    qkv = self.qkv(x)
    qkv = qkv.reshape(batch_size, seq_len, self.num_heads, 3 * self.head_dim)
    qkv = qkv.permute(0,2,1,3)
    q,k,v = qkv.chunk(3,dim = -1)
    values,attention = self._scaled_Dot_Product_Attention(q,k,v,mask = mask)
    values = values.reshape(batch_size,seq_len,self.num_heads*self.head_dim)
    out = self.linear_layer(values)
    return out


class LayerNormalization(nn.Module):
  def __init__(self,parameter_shape,eps = 1e-5):
    super().__init__()
    self.parameter_shape = parameter_shape
    self.eps = eps
    self.gamma = nn.Parameter(torch.ones(parameter_shape))
    self.beta = nn.Parameter(torch.zeros(parameter_shape))

  def forward(self,inputs):
    dims = [-(i+1) for i in range(len(self.parameter_shape))]
    mean = inputs.mean(dim = dims, keepdims = True)
    var = ((inputs-mean)**2).mean(dim = dims, keepdim = True)
    std = (var+self.eps).sqrt()
    y = (inputs - mean) / std
    out = self.gamma * y + self.beta
    return out


class PoisitionwiseFeedForward(nn.Module):

  def __init__(self, d_model, hidden, sequence_length, drop_prob = 0.1):
    super(PoisitionwiseFeedForward,self).__init__()
    self.linear1 = nn.Linear(d_model,hidden)
    self.linear2 = nn.Linear(hidden,d_model)
    self.relu = nn.ReLU()
    self.dropout = nn.Dropout(p = drop_prob)
    self.sequence_length = sequence_length

  def _get_encoding(self):
    pos = torch.arange(self.sequence_length, dtype = torch.float).reshape(self.sequence_length, 1)
    i = torch.arange(0,d_model,2).float()
    denominator = torch.pow(10000,2*i/self.d_model)
    even_PE = torch.sin(pos/denominator)
    odd_PE = torch.cos(pos/denominator)
    stacked = torch.stack([even_PE,odd_PE],dim=2)
    PE = torch.flatten(stacked,start_dim = 1,end_dim = 2)
    return PE

  def forward(self, x):
    x = torch.add(self._get_encoding(),x)
    x = self.linear1(x)
    x = self.relu(x)
    x = self.dropout(x)
    x = self.linear2(x)
    return x

class EncoderLayer(nn.Module):

  def __init__(self,d_model,fnn_hidden,num_heads,drop_prob):
    super(EncoderLayer, self).__init__()
    self.attention = MultiHeadAttention(d_model,num_heads)
    self.norm1 = LayerNormalization([d_model])
    self.dropout1 = nn.Dropout(drop_prob)
    self.fnn = PoisitionwiseFeedForward(d_model,fnn_hidden,drop_prob)
    self.norm2 = LayerNormalization([d_model])
    self.dropout2 = nn.Dropout(drop_prob)

  def forward(self,x):
    residual = x
    x = self.fnn(x)
    x = self.attention(x,mask = None)
    x = self.dropout1(x)
    x = self.norm1(x + residual)
    residual = x
    x = self.fnn(x)
    x = self.dropout2(x)
    x = self.norm2(x+residual)
    return x

class Encoder(nn.Module):
  def __init__(self, d_model, fnn_hidden, num_heads, drop_prob, n):
    super().__init()
    self.layers = nn.Sequential(*[Encoder(d_model,fnn_hidden,num_heads, drop_prob) for _ in range(n)])

  def forward(self,x):
    x = self.layers(x)
    return x

The Encoder class excutes the encoder block for ’n’ times, to incorporate more contextual awareness.

Now Let us run a sample.

d_model = 512
num_heads = 8
drop_rate = 0.1
batch_size = 64
max_sequence_length = 40
fnn_hidden = 1264
num_layers = 3

inp_x = torch.randn((batch_size,max_sequence_length,d_model))
encoder = Encoder(d_model = d_model, fnn_hidden = fnn_hidden, num_heads=num_heads, drop_prob=drop_rate,max_sequence_length=max_sequence_length, num_layers=num_layers)
out = encoder.forward(inp_x)
print(out.shape)
print(out)

torch.Size([64, 40, 512]) #shape of the out put, the 1st dim indicates the batch_size
# The 2nd dim indicates the max_sequence_length and the 3rd dim indicates the length
# of each word vectors, in this case with more contextual information.

#Output context vectors.
tensor([[[ 0.6655,  0.1361,  0.8126,  ..., -0.2920,  1.2847, -0.0087],
         [ 0.8477, -0.1586,  1.3966,  ...,  0.8745, -1.8591, -0.4387],
         [ 0.5206, -0.9075,  2.2957,  ..., -1.3360, -0.8176, -0.0395],
         ...,
         [-1.9075,  0.1863,  0.3541,  ...,  0.3549,  0.1154, -1.3263],
         [-0.3321, -3.0647, -0.5510,  ...,  0.3771, -0.1022,  0.3470],
         [-1.9302, -0.9350, -0.5568,  ..., -1.2365,  0.6684,  0.8379]],

        [[ 1.0789, -0.8911,  0.3333,  ..., -1.3667, -0.3351, -0.8922],
         [ 0.9204,  0.7196,  1.5533,  ..., -0.5870,  0.1859, -0.4472],
         [-0.5743, -0.3509,  1.3459,  ...,  0.8918,  1.5060, -0.9195],
         ...,
         [-2.1759, -0.0273,  1.2681,  ..., -0.3552,  0.6408, -0.0839],
         [ 0.0131, -0.8884,  0.4998,  ..., -2.1502, -1.6683,  0.2925],
         [-1.9016,  0.6277,  0.5923,  ..., -0.2352,  0.3410,  0.2342]],

        [[ 0.3605, -0.6696,  0.4470,  ...,  0.1450,  0.2459, -1.2633],
         [-0.0222, -0.7771,  0.3986,  ...,  0.7763,  1.2560,  0.1798],
         [-1.5896, -1.2116,  0.7199,  ..., -0.7168, -0.5128,  0.2306],
         ...,
         [-0.8741, -0.5346,  0.3203,  ..., -0.2295,  2.6045, -1.3942],
         [-2.2476, -1.7335, -0.3252,  ...,  0.2329, -0.2357, -0.4452],
         [-0.0817, -0.0796, -1.3318,  ...,  0.6366, -1.7131, -1.0573]],

        ...,

        [[-1.1127, -0.6404,  1.2334,  ...,  0.6637, -1.1961, -2.1753],
         [-0.4128, -1.4844,  0.9567,  ...,  0.0892, -0.6646, -0.4929],
         [-0.3610, -0.9177, -0.4646,  ...,  1.0498, -1.4170,  0.2720],
         ...,
         [-1.3369, -0.8566, -0.0355,  ...,  1.6660, -0.7807, -1.5038],
         [ 1.9652,  1.7274, -1.4894,  ...,  0.4022,  0.7700, -1.7864],
         [-0.5754,  0.3432, -0.2693,  ..., -0.1672, -1.0679, -0.9068]],

        [[ 0.1692, -1.6842,  1.6947,  ...,  1.3753, -1.2960, -0.0302],
         [ 0.1144,  1.4404,  1.6983,  ..., -0.1036,  0.1971,  0.7051],
         [ 0.5290,  0.0129,  0.0735,  ...,  0.2095, -1.7825,  0.0102],
         ...,
         [-0.2108, -1.1088,  1.4435,  ..., -0.5519, -2.1240,  0.7283],
         [-0.7239, -0.6731,  0.8940,  ..., -0.4373,  0.2791, -0.1646],
         [-0.0800,  0.5354, -0.9145,  ...,  0.0087, -0.5125, -0.3893]],

        [[-3.4398, -0.5187, -0.5283,  ..., -1.7687, -1.6411,  0.1256],
         [-1.0211, -0.6033,  0.5090,  ...,  0.5041,  1.0072, -1.3615],
         [ 0.0635, -2.1827, -0.1679,  ...,  2.1011,  1.7504, -0.7565],
         ...,
         [-0.6595, -0.4148, -0.4220,  ..., -0.5586, -1.9765, -1.7538],
         [-0.2886,  0.4294,  0.7907,  ...,  1.7319, -1.2356, -0.8140],
         [-1.3690,  0.6660, -0.2234,  ..., -1.8908, -0.3114, -0.8896]]],
       grad_fn=<AddBackward0>)

Training this model with the decoder will improve the context vectors over time.

Thus in this blog we delved deeply into the encoder part of a transformer and saw the implementation as well. In the next blog we will go through the decoder architecture, which is somewhat similar to this but with some significant differences. Hope this was really helpful.

Transformers from scratch part — 2

The Transformer architecture:

Attention:

Written by Sanjithkumar