Attention is all you Need!!!

vinodhkumar baskaran
6 min readJul 13, 2020

--

Yes , I need your attention . . . . . . . . .

Let Start to understand the Transformer attention architecture. . . . .

Transformer Attention

Is that look confusing .. No issue let simplify … come on….

Transformer Attention architecture can be divided into two layers

  1. Encoder Layer
  2. Decoder Layer

First we can understand the architecture of Encoder followed by Decoder.. Actually both are pretty simple ….

Encoder

Job of the encoder layer is to transform the input and observe some feature and sent it to decoder for providing us the output.

Encoder layer has two sub-layers namely ,

  1. Attention [ sublayer 1]
  2. Feed forward neural network [sublayer 2]

Understanding the Attention working mechanism will solve all the complexity one had about the Transformer attention.

Attention please !!! Its all about Attention ….

Before we dive into attention , keep in mind that input is in text format so we need to connect into embedding vector. We can use Glove or any other technique to convert.But Keep in mind Author used byte-pair encoding the text.

We can understand the attention with above diagram.[Eg. How are you]

Step1 : Input are encode into word embedding as x1, x2, x3.[Vectors of size 512]

Please note that single word/letter are consider as input.

Step2: Now with the Input vector need to calculate Query vectors, Key vector, Value vector.

Calculating the above said vector is not too complex . Its just a dot product between the Input vector and corresponding Weight matrix or Kernel.

Now Question arises,from where did this weight matrix pop out.

Answer is simple, these weight matrix are trainable parameter just like the kernels in the Convolutional neural network.

At the end of calculation, we will end up with Query vectors [Q1,Q2,Q3],Key vectors [K1,K2,K3] and Value vectors [V1,V2,V3].

Remember that author reduced the above vector dimension/size from 512 to 64 during the calculation.

Attention Internal Working FIg 1.2

Step3: Refer Fig 1.2 . At this stage we Query vector , Key vector and Value vector.

Next we want to calculate Attention Score which the goal/output of the Attention layer.

Step 3.1 : For simplicity , we can consider Q1 [ Query vector of x1 input], Do product between Q1 and all the Key Vectors [K1,K2,K3] as result we will end up getting use data called Scores [ Q1 * K1 , Q1 * K2 , Q1 * K3].

Step 3.2 : We Normalize the Calculated Score [ Q1 * K1 , Q1 * K2 , Q1 * K3] by dividing it with square root of size/dimension of key vector [ in our case its 64].

Step 3.3 : After Normalization, the values are inputed to the Softmax function. Function returns 3 values.

Step 3.4 : Output numbers then multiplied with corresponding Value vector. [Result is the weighted Context vector]

Step 3.5 : Finally all the above values are summed together which end up giving us the Attention Score for the input x1.

Step 4: Step 3 is carried for all the remaining input parallelly [in our case its for x2,x3]

Step 5 : Attention score S1,S2,S3 are fed as input to the feed forward Neural network.

Important is that each output of Attention layer is fed to corresponding Feed Forward Neural Network [FFNN].Number of FFNN is equal to the number of input.[As shown in the figure] Not to the single FFNN as a whole.

Feed Forward Neural Network:

As every neural network , FFNN process the input and produce the output.

Input is from the Attention layer. Here end’s the Encoder layer.

Decoder Layer :

Decoder Layer

Note: Wherever you see a word Attention in the above diagram , It does the same operation as discussed above.

Step1 : Encoder-Decoder Attention sub-layer of the Decoder layer takes a input from the encoder , combined data of positional encoding and previous layer decoder.

Step2 : Output of the Encoder-Decoder Attention is inputed to another Attention layer.

Step3: Result of the Attention layer is passed as input to the Feed Forward Neural Network.

Step4: Linear function is applied to the FFNN output which act as Flatten operator.

The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.

Step5: Flatten vector is inputed to the softmax which then turn those scores into probabilities and produce word as output for each time step.

Hope you understand the Encoder and Decoder w.r.t single cell. Now let us understand the stacked Encoder and Decoder.

Stacked Encoder and Decoder

Stacked Encoder and Decoder Fig 2.1

Input (How are you) is fed into the encoder stack and after processing the input , output at the Encoder 6 is fed into all layer of decoder stack as shown in the fig 2.1

Decoder stack outputs application’s end result.

Optimization used in the Transformer :

Transformer Attention fig 2.0
  • Batch Norm /Local response normalization is applied after each layer of attention and FFNN layer.
  • Similar to ResNet , Skip connection is introduced between trainable layers such as attention and FFNN. This skip connection is useful when trainable parameter show negative effect /no effect on the network.
  • Loss function used is Cross Entropy loss and adam Optimizer is used for training.

Unbox the terminology architecture:

If you the above fig 2.0 there still some keywords which are not discussed yet let us discuss now,

There is the wording called “ Positional Encoding” which marked in green box in the fig 2.0.

Remember RNN architecture which deals with text preserving the sequence of the text. But transformer takes input text as a whole so fail preserve the sequences. Hence in order to preserve sequence we introduce the sequence information along with input using “Positional encoding”.

There is another word called “ Multi-headed attention” which marked in orange box in the fig 2.0.

This is not new, we already discussed about it. Its just a “Attention architecture” but the change is that when we use multiple weight matrix to attain Query vector , Key vector, Value vector then such attention is called “Multi-Head Attention”.

Conclusion :

Transformer is the state of art architecture which leads to other state of art algorithm’s like Bert,Albert etc.

Hope this articles helps you to grab the knowledge about Transformers.

sources

--

--

vinodhkumar baskaran

Experienced data practitioner skilled in Python, ML, statistics, and NLP. Proactive problem solver, exceeding expectations with a positive attitude.