Transformers: Attention is all you need — Zooming into Decoder Layer
Please refer to below blogs before reading this:
Introduction to Transformer Architecture
Transfomers: Attention is all you need — Overview on Self-attention
Transformers: Attention is all you need — Overview on Multi-headed attention
Transformers: Attention is all you need — Teacher Forcing and Masked attention
Now let us look at full decoder block — a sample figure is shown here.
let’s zoom into this and see how does the decoder looks like. So while training we have the full translation data as input but we cannot use the full input becuase this input should be used with masking. Now we will look into first transformation — with the first input words and their embeddings we first construct the vectors Q, K and V for each of the input words. Below figure shows these calculations. For each word input we generate Q, K, V vectors by using WQ, WK and WV matrices and we derive this for each timestep. This is the first transformation the decoder layers works on.
Once we have the Q,K,V for all the input words and also self-attention where we are going to use and compute the new representation of word “Nenu” by looking into other words of the sentence with a caveat that we cannot see the future words — that’s where the masking comes. Now we know that how we can calculate that, we take all Q,K,V along with mask where mask matrix is defined as lower traingular matrix 0s and upper traingular matrix with -infinity. After this masked self-attention block we get a new representation for each of the word inputs — while we are computing this word representations we have attention from only those words which came before the timestep prediction, all other future words are masked. The output vectors is a combination of softmax(QTK + M)VT.
Every vector ouptut is legitimate and it ouputs a legitimate representation of each and every input word i.e, it has not seen any of the future word inputs. At this stage of proessing here is how the process diagram looks like
Te — is the input size of word inputs from the encoder
Td — is the input size of word inputs from the decoder
Now the encoder output and masked self attention layer will have interaction and produces the outputs with a new representation of all S by using a cross attention mechanism. The output from here will be another new representation of all ‘S’ (which is already a modified representation of inputs (h) which is already from decoder will be aware of all the inputs coming from encoder. Hence we have to use cross attention mechanism to get the desired output.
In the transformation terminology — whenever we are trying to compute a new representation of something, we call it as a “Query” and it is looking at everything else and finding a new representation.
Query is always for the word where we compute the new representation. In this case we are looking for words ‘s’ coming from decoder
Key and Value will come from the words that we are paying attention to — in this case it is from encoder (e1)
The function F above basically calculates the Softmax(QTK)*V/ sqrt(d)
Let us analyse the dimensions at each of the blocks:
For the decoder block:
- The inputs from decoder have dimensions T x d where each word input has ‘d’ dimensions
- The Ws are of dimension with d x d
- Total input is of ( T x d)
- Total output by multiplying inputs and Ws are of T x d
- Total output passed from the masked self attention is of T x d (The way this is derived is by Softmax(QTK) is of T x T and ‘V’ is of T x d — hence when Softmax(QTK)V will give T x d)
For the encoder block: (Similar to decoder)
- The inputs from decoder have dimensions T x d where each word input has ‘d’ dimensions
- The Ws are of dimension with d x d
- T x d Key matrix ; T x d Value matrix ; T x d Query matrix
- Total output by multiplying inputs and Ws are of T x d
- Total output passed from the cross self attention is of T x d (The way this is derived is by Softmax(QTK) is of T x T and ‘V’ is of T x d — hence when Softmax(QTK)V will give T x d)
The output from above calculations are passed to 1-layer feed forward network (FFN). The dimension for first layer is (d x d1) and (d1 x d). When these dimensions multiplied by cross network output ie., T x d the resultant output is shown as (T x d) ( d x d1) (d1 x d) — the final output will have dimensions of T x d.
The entire block shown above is called as one block and the output from this goes into another block. As we know that we have used masking for the initial block hence we need masking for all the blocks coming after the initial block. The initial block is repeated multiple times across the transformer network.
Decoder Output:
The final decoder output has to produce predicitons at each and every point. Assuming the first input word <Go> is sent across multiple transformer blocks and predict the distribution over a vocabulary. The output from all the blocks is d dimensional vector but how can we get V dimensional vector. So how do we go from d dimensional to V dimensional vector?
Here WD is a vector with V x d and when multiplied with each output word of d dimension — we get a vector of V dimensional but what we want is the probability distribution over vocabulary V which is derived by applying Softmax.
Please do clap 👏 or comment if you find it helpful ❤️🙏
References:
Introduction to Large Language Models — Instructor: Mitesh M. Khapra