A Deep Dive into Transformers

Published in

Analytics Vidhya

5 min readFeb 23, 2021

If you have not heard about Transformers in recent times in the field of NLP(Natural Language Processing) or Artificial Intelligence, then you are probably living under a rock. There has been an array of Transformer based architectures such as BERT, SpanBERT, Transformer-XL, XLNet, GPT-2, etc getting released frequently for the past couple of years. The OpenAI’s GPT-3 had taken the internet by storm with its ability to perform extremely well on tasks such as Q&A, Comprehension, even Programming(the best part was where it added the comments). Check out here to see what all GPT-3 can do.

But all of this started with a research paper released back in 2017 “Attention is all you need”. The paper proposed a novel Deep Learning Architecture to process sequential data without the use of RNNs and CNNs.

The architecture looks reasonably simple and claimed to achieve higher accuracy in Language tasks with less training time compared to CNN / RNN based models. The architecture explains the Attention Mechanism which is the cornerstone of this architecture. We will try to understand attention.

Let us consider a sentence as below.

In general, we would convert the words in this sentence into token embeddings and send the embeddings into the network.

Unlike RNN based models the Transformers need to see all the tokens at the same time since it does not have a recurrence. The attention mechanism tells the model which token to focus on or attend based on the position(order) and the context. The correlation between the embeddings is calculated by doing an inner product within themselves. If we project these embeddings in the n-dimensional space the dot product will result in a high scalar value if the angle between the vectors is less. For Example

If θ is closer to 0 the dot product results in a higher value (cos 0 = 1). You can consider each dimension in the embedding vector as a context. If any two words are closer in the contextual space then their correlation will be higher. Now, we take all the embeddings and do the dot product.

Here we take two copies of our token embeddings call them Key and Query. We do the dot product between Key and Query to get the scalar values as a result. The darker the square higher is the value of the product. Since these values can be high in magnitude we divide this scalar output with the square root of the vector size which will give a nice scaling effect. Now we apply softmax along the y axis to calculate attention.

Now we do a scalar multiplication between attention and the original token embeddings(Value) to get the Contextualized Embeddings. Every token in this embedding will have some amount of context diffused from other tokens. The magnitude is given by the attention map. Take a little while and imagine how this happens.

Now let us look into the encoder part i.e. the left-hand side of the Transformer architecture.

Now once we understand the attention mechanism, the rest of the components in the architecture is reasonably straightforward; wait a minute! Do you notice something called “Multi-Head Attention” in the Encoder? It is merely splitting the attention processes along the embedding dimension to add versatility to the network.

Notice that there are feed-forward networks present before the embedding heads, these networks learn to mix and match embeddings into the multi-heads depending upon the context. In the end, the computed contextualized embedding heads are concatenated back.

Ok, now we are all set to code the encoder part!

Now the Multi-Head Attention.

The output of the encoder part(the latent vector) goes to the decoder part.

The Decoder part is pretty simple as it takes the target sequence and does all positional, token embeddings, and multi-head attention. But the Multi-Head attention layer takes both the encoder outputs and the target embeddings to compute attention. The self-Attention for the target sequence is done by the “Masked Multi-Head Attention” layer, which simply masks the words that are supposed to be predicted by the network.

Note that there are Highway/Skip connections running along with both the encoder and decoder parts. This takes care of the flow of positional information throughout the network(remember there is no concept of Recurrence here).

Lets us code the Decoder part!

I trained a German-English translation with a Seq2Seq Transformer network using the above-mentioned Encoder and Decoder Design. And the results are seen below

And some of the translations with their respective attention maps below

Attention maps for German-English Translations

Refer to this colab notebook for complete code. https://colab.research.google.com/drive/1pA-mafHWx6Jh4xzEAKqk1Oi2RsX1vO7l?usp=sharing

You can train the transformer for your own purpose say Q&A, Chatbot, Comprehension, Machine translation, etc by engineering the input and output sequences accordingly.

Cheers!

A Deep Dive into Transformers

Written by saravana alagar