But, WTH is a Transformer

Jeydev
5 min readDec 3, 2021

--

An In-depth understanding of SOTA architecture “Transformers.”

It, all started back in 2017 when “Vaswani et al” of Google AI research published a revolutionary research paper named “Attention is ALL you need” with “Attention” as its secret recipe, after just 4 years we now have Large Billion parameters Language models like GPT, Turing NLG, Bert, etc.. doing many impressive language tasks from Summarizing News to Guiding us to write proper code.

OpenAI GPT3 for Question answering image from https://jalammar.github.io/
An Image of GPT3 generating answers to questions source: https://jalammar.github.io/

But how do they all work? Let’s dive in and find it out.

Note: Before Diving in I assume that you have a basic understanding of Deep Neural nets else, you may find this blog series difficult. I have divided this blog into many sections, in this blog, we will discuss basic intuitions that help you understand Transformers better by end of this blog series you will have a fair understanding of transformers and training models in Huggingface not only transformers but also, BERT, GPT and many more.

Problems with RNN

Before diving into the architecture of Transformers we first need to understand why we need it. Why not a simple recursive RNN or LSTM

Well, there are Few problems and limitations with RNNs and LSTMs:

RNN translation by processing inputs one by one and getting previous states to produce outputs
  1. They often fail to give better results over long-length sequences of text (In this case) and often tend to not take all information of the sequence into account as the sequence length becomes larger and larger. You can think it of as a window sliding over the sequences of text as the sequence get larger the window just can't cover all parts of the sequence. A mathematical way of answering it is that as the sequence gets longer the model would experience a diminishing gradient effect, during backpropagation we compute gradient to a longer function (Literally many Matmuls) with hidden states flowing from previous units, and our gradients would become small that update happens by a tiny amount.
  2. To rescue this LSTMs and GRU came in, unlike RNN they were able to generate pretty good longer sequences compared to RNN, However they were only good compared to RNN and not super impressive.
  3. Pretraining on large corpus and fine-tuning for different tasks doesn't work with RNNs

Both the architecture was improved over time by Using Dropouts, Residual connections, and modified versions like Bidirectional LSTM, LSTM with attention were introduced. Yes you, heard it right “Attention”

Attention is ALL you need

Let's begin this with a basic question what is Attention?

Attention scores are matrices that help neural nets answer the question “what other words a word should focus on in a given sentence”. On the right side is a mapping of the attention matrix of a well-trained neural translation model. Here, the word “She” pays more attention (focus) to the word “Elle” and the word “ate” pays more attention to “mange”.

This gives the model more meaningful context-rich information to process further. of course, in practice, this attention is taken between embedding matrix or hidden state matrices. 2 main types of attention are “Global” and “Local Attention”, In Global attention, we take attention over all words in the sequence (Like we did above) whereas in local attention we only compute attention for a subset of the input words sequence also, we have “soft” and “hard” attention which we will discuss later. Attention layers are parametric and the goal of the model is to make a meaningful and context-rich representation i.e (Attention matrix) by updating Attention weights and biases during optimization and backpropagation.

Let's talk transformers

Transformers architecture from attention is all you need

Okay, Let me tell you this, We will only have a quick overview of transformers architecture now and will be discussing every individual architecture in-depth in the next episode of this blog.

By looking into this architecture image you can say that it is a seq2seq model with Encoder and a Decoder processing text by sequences but wait there is the cap, transformers actually don't process sequences one by one like the RNNs and LSTMs does the basic underlying idea is to process them parallelly and not sequentially(one by one)

To start off we will talk about a supervised learning task using transformers.

for example, let's say our task is to do Question answering I have a data sample where I want the model to learn this input: “who is Elon Musk” to outputs (shifted right): “He is the CEO of SpaceX” now these inputs are not processed word-by-word like an RNN rather it is processed in whole sequences as sets of embeddings.

Well, Now a question should arise in your mind if they are not sequential how does our model know which word is first and second (Positions), Well the paper authors came up with a pretty cool way called Positional embedding by which we inject positional information into our word embeddings. Don't worry about these terms you will get a full intuition on this in the next episode.

As an overview of this model architecture, we will be taking input and output and letting Neural nets process them in a way that produces context-rich results from the input during inference.

Key things to remember:

  1. Transformers do not process input sequentially.
  2. Since they are not sequential they make use of positional encodings to get positional information.
  3. We will do some transformation with the attention mechanism which we will discuss in the next episode
  4. Also, take a note of the problems caused by RNNs, and let’s find out how Transformers solve them In the next episode.

If you feel informative with my content make sure to follow my page so that you can learn many from me for no cost, With that said we are ending our first part here Let’s meet in the next episode.

Good Bye from Jeydev 👋

Edit, part 2 is now out: https://medium.com/@ai.paperdeck/understanding-transformers-b1ec2517047e

--

--

Jeydev

Data Science Student | Optimizing Learning Curve | Intuition > Computation.