Transformers: Attention is all you need — Overview on Self-attention

10 min readNov 1, 2023

Please refer to below blog before reading this:

Introduction to Transformer Architecture

How to we transition to transformers?

We have a basic encoder and decoder architecture like the below figure which has all the RNN connections, but this causes a problem (discussed in the previous blog).

Then we have attention based encoder — decoder model where we did all the encoder computations and got rid of the encoder layer and we took just the outputs of the encoder and attention function calculated at each step of the output thereby calculating a new context vector at each step.

so with this model — we were able to compute for one time step in parallel but NOT able to compute for all the time steps t1, t2, t3, etc., becuase we have to wait for previous timesteps calculations to finish.

Hence this is the right time to shift to transformer network which again has an encoder and decoder architecture along with some other functional blocks. Here is how it looks like below

Here the output from encoder which represents as z1 to z5 but these are very similar to the output from seq2seq RNN or attention based seq2seq RNN but only one thing that varies is the contextual representation of the outputs.

Now let us look at each of the yellow boxes above and deep dive into mathematics behind each of the blocks.

What is Self-Attention?

The input sentence is randomly intitialized with embeddings (not relying on any embeddings of glove/word2vec) and passed into the self-attention layer of encoder to get the output vectors with new represenatation of each word input. How do we compute attention here? Well , we know what attention is and all it requires is a pair of vectors as input hence now attention calculation is to be done where it requires these 2 parameters.

At timestep ‘t’ for attention based seq2seq RNN we used to compute attention as

If we want to know about ‘movie’ word represenation and call it as s4 using attention mechanism — can we calculate using attention weighted sum representation of all the word inputs ? — yes we can do that

Now with the new attention mechanism we will have equation in this format and we can deep dive on what this fAtt about in the later part of this article

then we can have

where hj are called as word embeddings for each of the timestep and t = 5 as we have 5 words in the input sentence.

So the important point here is that we do not need to calculate all the Alphas in sequential way as we already have all the hj (embeddings) — hence we can have calculations for alpha in a parallel way for all timesteps. As we are looking for attention within encoder or attention within the input itself — hence this is referred as self-attention.

Let us take another example sentence to better understand this self-attention.

“The animal didn’t cross the street becuase it was too tired”

the word “it’ in the sentence refers to “Animal” or “Street”? we know “it” is referring to the word “Animal”. Let’s modify the sentence:

“The animal didn’t cross the street because it was congested”.

Now the word “it” is referring to the word “Street”. Therefore it is important to establish a strong connection between the word “it” to the workd “street” or “animal” based on the context.

It calls for an Attention mechanism. That’s why it is called a self-attention (to distinguish from cross-attention which we will see later).

The goal

Given a word in a sentence, we want to compute the relational score between the word and the rest of the words in the sentence. This is how we can calculate similarity score based on the heatmap shown below. the numeric values shown in the table are shown

we can think of headers in the first column as si and headers in the first row as hj (just for convenience). However now both the vectors (si) and (hj) , for all i, j are available for all the time (whereas in the seq2seq model, hj for all j are available and si was obtained one at a time in the decoder stage)

Now let us discuss on what is the attention function to use. If we recall the score function used in the seq2seq model

There are 3 vectors (s,h,v) involved in computing the score at each timestep (of decoder) .

There are two linear transformations happening here with Watt and Uatt respectively. One non-linearity in the form of tanh. Hence a vector (green vector below) is formed by the internal calculation done by all these transformations. Here is how it looks like.

finally a dot product is formed by 2 vectors (V & green vector) to form a score. In the earlier case this st-1 is generated at every time step, but here for hi and hj they just have one word embedding for hi and one word embedding for hj whereas in previous method we have 3 vectors (s,h,v) participated. So now with these hi and hj how do we get the three vectors into participation so that we get into a scoring equation.

How do we get the 3 vectors for each of word embedding ? How do we do it?

Matrix transformation is the key here.

So for each hi — we need to have matrixes with Key matrix (K), Query matrix (Q) and Value matrix (V). Let us assume we have these matrices developed in the below figure

WQ, WK, WV are called respective linear transformations (parameter matrices).

Let us focus on first calculating the first output (z1) from self attention layer based on the input word embedding vectors which is the contextual representation of the word “I”

Let us visualize that for each word embedding we have already calculated 3 vectors as shown below. We will learn about the calculations of all these numbers a little later.

we can see from below that for each of the embeddings ie., h1 to h5 we have three arrows pointing outside where we gonna calculate different values from each vector.

As a start — let’s take h1 as first input word vector embedding and let us do linear transformation for this vector with WQ to form a vector q1, similarly we do other transformations with vector WK and WV respectively to form vectors k1 and v1 respectively. So basically what we have done here is one word vector embedding is converted to 3 different vectors by doing linear transformations.

Similarly we can do the same vector calculations for all the 5 words (h1 to h5). Below figure illustrates the calculations done for h1 and h5 respectively.

Initial representation of self attention with h1 and h5 as inputs as shown below.

First linear transformation with WQ is done to get a new vector called as query (q1).

Second linear transformation with WK is done to get a new vector called as key (k1).

Third linear transformation with WV is done to get a new vector called as value (v1).

If all the linear transformations applied on the last word as well — then this is how it looks like

Earlier on attention with RNN, we were calculating score between st-1 (state of decoder at timestep t-1 — which is fixed) and hj (input words — where it is giving importance of jth word at ‘t’ timestep). Now here also we have scoring function where we have ith and jth words — so our scoring function looks like — score(q1, kj) where kj iterates through all 5 words and basically giving 5 scores for each of the input q1 word w.r.t all the 5 other words.

The score function is basically a dot product of vectors and alphas are calculated based on the softmax function transformation.

for example if we are calculating the Alpha12 then here is how the equation looks like

where ‘q’ and ‘k’ have participated to calculate alphas. Once the softmax is taken then it will give all alphas — then how are we getting the z values? Here is how the equation for z is given and it depends on the value vector values ‘v’

Similarly for all ‘z’ values — here is how we are going to calculate with all the other modifications included.

Recap : How these 3 vectors (Q,K,V) are used and computed

Can we vectorize all these computations and compute the outputs (z1, z2, …., zT) in one go?

Yes — we can parellize the outputs. Assume we have T input word representations — we will get T output representations for the query.

For the query (Q) vector matrix, here is how calculations are done with WQ and ‘d x T’ matrix for the input word vector representations. We can assume the input word vector representation can be of 64 dimensions hence the matrix is of 64 x T dimensions. Assume WQ is 64 x 64 matrix, the final matrix for query (Q) vector is also same as the input word vector matrix which is represented as ‘d x T’ matrix — hence in this case it is 64 x T.

if we had ‘T’ input representations — we got ‘T’ output representations, what will change is the dimensions parameter (it can be 64, 128 etc., as you wish). All the query computations are done in parallel for all ‘T”.

Similarly for key ‘K’ representation.

Similarly for value ‘V’ representation.

By this now we have parallelized all the vector calculations of K, Q, V. Now can we think of rest of the equations also be calculated parallely? Let’s see if we can output entire output Z at one go? Yes we can!

here in the above equation we can calculate Q, K, V in parallell. Let us check this with an illustration for one of the Z for e.g., z2.

Here let us check the dimensions of each of the vectors

Q — 64 x T and QT — T x 64

K — 64 x T

when both QT and K mulitply we get T x T matrix which is esentially the attention matrix.

V — 64 x T and VT — T x 64

when (QT K) with T x T dimensions multiplied with VT then we get ->T x 64

Let us multiply the above matrix with VT. Here is the output for it

The final output for Z2 is given as

also Z2 represented in alpha terms as

All the matrix mutiplications can be done in parallel. So basically we are getting the outputs ‘z’ (z1 to z5) in a parallel way.

As a reacap the final vetorized formulation of output is shown as below.

This is the final full self-attention block with all internal calculations as shown below where it starts with linear transformations, matrix multiplication, scaling, softmax and matrix mulitplication which gives Z as output which has all the values of z1 to z5 or zT.

Conclusion

We are able to see the self-attention layer which does a parallel computation to give the contextual representation for the word input vectors that are provided.

Please do clap 👏 or comment if you find it helpful ❤️🙏

References:

Introduction to Large Language Models — Instructor: Mitesh M. Khapra