Introduction to Transformer Architecture

4 min readOct 27, 2023

Let us take an example or usecase of an English sentence translation for e.g., “I enjoyed the movie transformers” which gets translated to “Nenu sinima ṭrans‌pharmar‌lanu asvadinchanu” in Telugu language. As this is a machine learning translation problem — it has to atleast use the RNN based encoder and decoder based architecture to get the desired output.

Here is a sample figure illustrating the encoder and decoder based architecture.

let us look at below sample RNN figure

If we are using RNN’s — the input data is fed into the encoder block which is happening at one go. But computations inside each of encoder blocks (yellow rectangles below) is not happening at one go as they are happening at sequentially.

So what is the benefit with this sequential computations?

The major reason for this is that for each of the words in the input data — we already have embeddings which are precalculated based on the context of the corpus (e.g., wiki ) using word2vec, FastText, glove etc., But what we want to know in this translation example is that the embeddings need to be calculated based on the context of the entire sentence that is given as input which is passed into the encoder block.

This contextual representation can be achieved in different ways one with the one-directional approach and other with the bi-directional approach. This is really good for our desired output but what is not good is in the interest of doing contextual representation — doing this sequential thing is not good. We need to reduce the computation complexity especially in encoder part where the seuqential calculations are happening!

Below figure shows the one directional approach but we can develop this using bi-directional approach. For simplicity purpose we are actually focussing on one-directional approach only.

Here h0 is called as initialization vector and the final state h5 is called as concept vector/thought vector/context vector / annotation. This h5 goes into the decoder model and this is the only vector which is useful for the outcome — when we use traditional methods like RNN.

In the decoder part — we do not have much option to parallelize the computations but for encoder we might need to parallelize and this can be achieved by using certain methodolgies (Attention mechanism resolves this). The decoder output does not care about the alignment of words between source sentence and target sentence i.e., if the word output from decoder for e.g., ‘Nenu’ has to be aligned to encoder part of ‘I’ or ‘sinima’ has to be aligned with ‘movie’. (Attention mechanism resolves this).

So we want a model similar to above but we want to do this at a parallel mechnaism where we don’t want something that happened to ‘t-1’ and send that as an input to the timestamp ‘t’. The encoder part needs to be changed to non-sequential but decoder part can be sequential.

Attention a quick tour:

Here we will try to build an attention based RNN model. Suppose that we create copy of these hidden state vectors (h1,h2 ….h5) and make those available to the decoder. Once all the vectors are available to the decoder we can drop the encoder block (until the decoding completes).

The attention mechanism consumes these vectors (context aware representations) as one of the inputs. So now for the first word as input to the decoder, the context vector for this input is given as

let us look at the alignment of words and also the alignment function:

This is the matrix or heatmap formed by no of words (rows as output) X no of words (columns as input)

Takeaway 1:

Takeaway 2:

Major Limitation:

Everything about the RNN based sequence-to-sequence model seems good so far. They performed well for translation using an attention mechanism. however there is a major limitation in training the model.

Given a training example, we can’t parallelize the sequence of computations across time steps.

Wishlist:

Can we come up with a new architecture that incorporates the attention mechanism and also allows parallelization ( and of course, get rid of vanishing/exploding gradient problems). Let’s see if we can get this.

Please do clap 👏 or comment if you find it helpful ❤️🙏

References:

Introduction to Large Language Models — Instructor: Mitesh M. Khapra