NLP be thy kingdom come oh great transformer

Transformer Neural Networks

One Architecture to rule them all.

Pickleprat
18 min readApr 2, 2024
The transformer architecture. Credit : Jay Allamar

“In the beginning was the RNN, and the RNN was with God, and the God was Transformer.” — John I. from the Bible, if he was a machine learning engineer.

As rightly quoted from our fictional bible hero, when the first time the transformer architecture came out, it was a literal God for the NLP community. A life saver architecture that was not only simple to understand but also was capable of giving faaar superior results. The personification of the results attained from RNN and Transformer would be equivalent to First Year Computer Science student with Senior dev with 10+ years of experience. (Juniors also happen to write complex but useless code, so that example really worked out)

Story so far…

To summarize the theory discussed in this NLP series of mine, we have so far discussed the Recurrent Neural Network Architecture.

Which had the following problems:

  • It didn’t have the ability to retain long term memory, which was caused due to vanishing and exploding gradient problem. This restricted the length of input sentence while training and testing.
  • It was sequential in nature, making it very inconvenient to maximize the usage of GPU’s to make the computations faster.

The next architecture introduced was LSTM.

And although this solved the long term memory problem by introducing additive properties in the gradients, preventing it from vanishing or exploding, it still had its problems.

For starters the architecture was still extremely complex and sequential in nature, which meant that you couldn’t use parallell processing to compute the output in training and testing phases.

In cases of the previous architectures like ANN and CNN, the inputs were not as complex and language. In ANN the inputs are simply not that large relatively speaking.

Whereas in CNN, you do have images, but that’s it. It’s a fixed size and you can reduce the computation by applying convolution, or Pooling or Flattening. At the end of the day, despite data being large, it still is simple.

Language is a completely different story, in case of using LSTM’s for NLP, not only is the input size theoretically infinite, the keywords in themselves have hierarchal relationships with one another, making the data much more complex to mathematically represent while conserving the true essence of the sentence.

This creates two new problems in NLP.

  1. There is no mechanism good enough to capture the complex nature or , for the lack thereof a better word, essence of the language.
  2. Despite not capturing the essence of the language, the computations required to do even simple tasks in NLP like sentimental analysis take a very long training time, because of large sequence lengths and data flow of RNN’s going from one layer to the next.

The state of the NLP community can be aptly described as “settling for the ginger girl with braces because no one else will ever love you”. She’s not good enough but she’s all you got. Ukwim?

Enter the Megan Fox of the NLP world, which is the transformer architecture. It fixed both the issues. Not only did it make the computations a lot more simple to understand, but it also caused a significant improvement in the NLP tasks. From simple sentiment analysis and NER, NLP went to complex language translation and text generation, in a jiffy.

The transformer architecture could be referred to as the beginning of the fourth Industrial revolution. It didn’t simply kickstart NLP, but it really put us a leap forward in the field of Deep Learning itself. Going from simple ANN’s dealing with simple data types to understanding complex data types like language. Transformer architecture truly is the next generation of Deep Learning. It’s quite appropriately called the death of RNN’s.

How does it do it? Let’s find out.

Disclaimer

You’ll find 100’s of articles on Transformers being called the death of Recurrent Neural Networks (Including this one). However, I have to say that the ominous statement in itself is not entirely accurate.

For NLP, Transformer architecture is simply the better choice because it is less time latent. It enables parelell computing capabilities, which allows us to do more computations in faster time.

But is it really always the best choice? In NLP probably yeah. It is going to always give you better results. The more data you throw at it, the better it is going to perform. This is something that data scientists in the past didn’t see with other architectures for NLP. This is one of the most remarkable achievements of the transformer Architecture. All you need (along with attention) is better infrastructure and more data, and you’ll have a better model. Although this is expensive, it still is better than being uncertain about how we can make a better model.

But does that mean RNN and LSTMs are dead? I doubt it. For starters if you’re going to take a small task, using a transformer would be overkill. LSTMs perform just fine while dealing with Sentiment Analysis or Named Entitiy Recognition with basic datasets.

Not to mention, LSTM shines in domains where you want to do time series analysis.

Transformer architecture is the best thing to ever happen to deep learning, however, LSTMs aren’t useless. They aren’t dead. You don’t have to overkill by using transformer every time. Sometimes the simpler answer is the best one.

Solution to LSTM

I’ve already discussed why the transformer architecture is such a big deal. It resolves all the issues with LSTM meanwhile getting not only far superior results but also being extremely simple.

Now by simple, I in no way intend to imply that the architecture is EASY. Simple and easy are two different things. The transformer architecture is simple but not easy.

The reason why it is referred to as Simple is because it does not use RNN’s for language modelling. It uses Simple Feed forward neural networks.

How fascinating is that. This simple fact makes it not only one of the most easy to use architectures but also make it super amicable to GPU’s.

The computations for individual words are not only correlated with each other but also can be used in parallel using a GPU.

So this has all the solutions to issues of LSTM with the resolution of the nitty gritty computations.

It’s faster ! It’s simpler ! It’s better ! It’s PERFECT.

So how does it do what it do ?

There are two key factors, Self attention and Multi head attention. Let’s understand the difference.

source link : https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a

This is an image that shows how self attention works. The green blocks in the image are known as Word Embeddings.

Word Embeddings

A word embedding is simply the encompassed context of an entire word into a single vector. If you’ve been in the NLP sphere for sometime I’m sure you’ve heard of the CBOW and Skip-gram architecture. The way either of these are trained, are by using the previous and future words to predict the word in between. This way they can encompass the contextual meaning of the entire word into a single vector. The simplest way to put it forth would be that it is a one vector representation of the word.

It should be noted that the transformer architecture does not really use CBOW or skip gram architecture. In order to make the embeddings, the architecture has a custom embedding layer which is trained along with the dataset. In order to output word EMBEDDINGS and not LOGITS, there is no activation function attached to these layers.

Because of this, the embeddings are more specifically tuned to the dataset it is trained upon.

So if you train a mini transformer for a specific use case like programming, the word weights for the programming keywords will be in mind while generating the word embeddings.

The way these embeddings are generated are as follows:

  • Step 1: You take the large corpus of text and generate tokens using a specific tokenization algorithm. For example byte pair encoding (BPE) or Wordpiece.
  • Step 2: You convert the corpus into tokens. You create an embedding layer with the input layer having n neurons where n is the total number of tokens generated from the corpus.
  • Step 3: You one hot encode the tokens and send them as input to the embedding input layer.
  • Step 4: The embedding input layer is going to send it to the next embedding layer, The size of which can be chosen by you. Typically it is chosen to be 512 for small models and 768 or above for anything larger. This parameter is called embedding_size typically denoted in transformer nomenclature as d_model.

Following these steps is going to generate a word vector of length d_model for each token provided as input.

For example: Let’s generate embeddings for the corpus “Hello World”. A little too small but hey it will work just fine.

We will using basic space splitting for tokenization algorithm. If you wish to look into Byte Pair Encoding or Wordpiece Algorithm, you can find detailed tutorials on Hugging face.

So our tokens are “Hello” and “World”. They are going to go through the One Hot encoding and convert into vectors: Hello == [0, 1] and World == [1, 0].

Let’s assume our d_model = 5.

In that case inputs (2 x 2) * W_embed ( 2 x 5) = embeddings (2 x 5 ).

To be clear, the word embeddings are not simply generated based on one INPUT. Here our one single input happens to be our entire corpus. However there could be multiple passages as input and the concatenation of all those passages would be our corpus.

However note that embeddings themselves do not have any indication of positions of the words. The information of the arrangements of words is completely erased from the vector. We don’t simply need the word embeddings themselves. We also require the information of the positions of the word in the sentence. For this purpose we use Positional Encoding.

A detailed explanation for this is given in my article about GPT2.

You may not understand everything as I haven’t really discussed the transformer model yet and since you are reading this, I have to assume you know nothing much about it. (unless you’re a senior engineer just reviewing this article in which case, you got a job for me perhaps? ). However, the positional encoding section is independent of the attention mechanism and can be read seperately without reading the entire article.

Transformer Encoder

Self Attention And Multi Head Attention

Now that we know how to generate word embeddings. Let’s understand how attention works.

Now the thing about self attention is that most explanations explain self attention to you first and then explain multi head attention to you later.

For me, this creates a sort of confusion as to what really is happening in the feed forward pipeline of the transformer architecture.

I always thought the same self attention procedure is repeated on the entire word embedding for each head and each head generates vectors based on the ENTIRE word embedding as input.

The point I’m trying to make is, we’re going to learn them both concurrently to understand the EXACT flow of how it happens inside the transformer architecture. This keeps the explanation a lot closer to reality and helps you get a better insight as to why it is done the way it is done.

  • Step 1: An input parameter for the total number of heads is provided. We’ll use the variable head_count to store this number. For the purposes of this tutorial we will
  • Step 2: The word embedding passes through a QKV layer. The QKV layer is typically of shape (d_model , 3 * d_model).
Embeddings going into the QKV layer.

Here the input token count is the number of tokens in the input sentence at an instance. Of course this is also batched, but we’ll ignore batches for the sake of simplicity.

  • Step 3: The QKV layer outputs a (input token count, 3 * 512) vector which is then split into 3 vectors. Query, Keys and Values.
Breaking down the embedding

Step 4: Each of the Key, Query and Value embedding is going to be further broken down into head_count parts. For our case, let’s assume the transformer has 8 heads. So vector will be split into 8 parts, which is going to make an (input token count, 64) vector.

Vector splits

Now (Qi, Ki, Vi) is sent to each head. This is called Multi Head Attention. Split parts of the Key Query Value pairs are sent to seperate heads in order to be operated on independently. We haven’t yet covered what this “attention” is, however we will eventually get to it.

At an attention head i the following happens:

Attention head i.

This follows the mathematical map of the following equation:

Attention equation: credit to this article

To understand better what these key, query and value matrices are, we’ll dive a little deeper intuitively.

Let’s say we have a sentence: “Sakuna is better than Gojo Saturo”.

Now here, Let’s say our word query is “better”. What we want to do is create an accurate word embedding for this vector based on previous words and words ahead of it.

So what do I do? I use all the words in the sentence which is stored in the keys matrix and identify which of the words are related with the word “better”. In our example the words “Sakuna” and “Gojo Saturo” are being compared with each other using the word better. So we can intuitively say that the vector values of Sakuna and Gojo Saturo should have high correlation with the current word vector “better”.

So we compute the correlations my multiplying the query vector with each of the key vectors (which are rough representations of all the words). This gives us a (count, count) matrix, showing the correlation of each word with every other word in the sentence. Here count is the total number of tokens in the sentence.

Now based on these correlations, we create a vector matrix called V, where each row is going to be a weighted sum of all the other vectors in the sentence capturing the semantic meaning of the word in themselves.

So the word vector for better in the V matrix will be a weighted sum of “Sakuna”, “is”, “better”, “than”, “Gojo”, “Saturo” with maximum weights attached to “Sakuna” and “Gojo” and “Saturo”.

  • Step 5: Concatenation of the Zi vectors coming from each attention head, and then passing it through an ultimate layer to produce a final Attention based word vector.
Multi Head Attention is complete.
  • Step 6: This Z will once more get positional will once again be normalized and get positionally encoded. Once it is done, it will go through a feed forward Neural Network Layer. Once it is out of the Feed Forward Layer, it will go through a Key Layer and a Query Layer. These are the outputs of the Transformer Encoder.
Encoder Outputs.

So far whatever we’ve discussed was the transformer encoder. The outputs of the encoder are Qenc and Kenc.

This essentially is at the heart of the transformer architecture. It is the mechanism that allows the transformer architecture to understand language at a deeper level.

The overall transformer encoder looks as follows:

Encoder credit: Here

What you have in your hand right now, is called a Transformer Encoder. By simply using this half of the transformer, you can easily create a Next Word predictor, or a Sentiment Analysis Transformer. But we came here for bigger and better things. We came here for the whole thing. Not just the transformer encoder, but also the decoder. The whole nine yards of Neural Machine Translation.

So what happens Next? Well if we’re taking the Example of Machine Translation, the next step is going to be to repeat the same few steps but with using something called Masked Multi Head attention for the targeted translation of the input.

But Machine Translation is Boring and something everyone else does. Let’s try to be a little different and instead train a chatbot but using both encoder and decoder unlike GPT which is a decoder only architecture.

The training data for the chatbot will have input which is going to be the data for user queries, and desired responses from the chatbot.

So something like this:

{
{
"query": "What is pi?",
"response": "Pi is a greek alphabet often used in Math to represent the
most famous irrational number which goes 3.14159.."
},
....
....
....
{
"query": "What's up bro?",
"response": "Nothing much! What about you?"
}
}

Transformer Decoder

In this case, our input, i.e. the user queries will be encoded by our encoder architecture. A Qenc, and Kenc will be produced from our encoder and received by the decoder. Why? We’ll look into that soon enough. Just bear with me.

Now! Our responses are going to be generated by the decoder. The issue with that is, before in order to encode our vectors to get the most apt description of our word we were also using future word context along with the past. We cannot simply afford to do the same with the decoder.

Because if the responses are to be generated by the decoder it can only have access to the input states of the previous tokens. Not of anything ahead of it. So in order to generate the Value vector of the Future tokens we can only do a weighted sum of the Previous tokens and the current one. We cannot use Future tokens, because this is our output generator.

The idea is simply this:

  • Step 1: We use something called Masked Multi Head attention during training so that the inputs are simultaneously processed and evaluated during training.

Masked Multi Head Attention

Masking just involves one little extra step that allows you to hide the tokens ahead of the word in the attention matrix.

While multiplying Q * K.T we add an extra matrix M which is filled with zeroes in the lower triangular half and negative infinities in the rest of the half.

This is because when you take the softmax of a single row inside the matrix, softmax of the negative infinities is going to become zero so the weights associated with any future terms is going to become zero.

From the source : Here

The infinities will vanish when you take a softmax of the matrix and our new Value vector is going to be a weighted sum of only the current and previous tokens.

Masked Multi Head Attention

Here Mask is the -infinity matrix we generated. As you can see it is added to the attention matrix with the associations and it is used to eliminate the future tokens from the weights that will be generated by the softmax.

Since the softmax has 0 weights associated with the future tokens and therefore is not using them to predict the output.

The masked attention layer at the end generates a (count, 64) matrix like before which is concatenated and used to generate a value vector instead of a Z vector like in the encoder.

So far this explanation, might sound satisfactory to you. But think a level deeper. This explanation is going to work during the training phase. You can easily generate a Query and a Key, and a Value. You can then go ahead and use the Query and the Key from the output tokens because you have the entire output in your hand.

These Values Generated from Masked Multi Head Attention is only used DURING the training phase where we have future tokens. During generation, the previous tokens themselves are sequentially provided as inputs to the layer ahead of the masked multi head attention.

The values generated after going through the concatenation phase.

Multi Head Attention with masked Z’s from masked attention heads. Here Z is actually going to be used as Vector V for the next layer.

But in inference phase, the outputs have to be generated sequentially using previous tokens, you must predict the next one. We CANNOT use the Query and Key from the output tokens generated because that would mean we are using information that we really won’t have during inference.

During producing an output, we will be provided with a <SOS> token, which is going to generate a certain output BASED on the K_enc and Q_enc vector.

This output will have to be then used along with SOS token to produce the next token, and so on.

So the Qi and Ki in this masked multi head attention will have to come from the Q_enc and V_enc from the encoder architecture.

This will ensure that the previous inputs are also used in generation of the output text.

The Q_enc and the V_enc’s are going to be used along with the Value Vectors either generated or used from the output. During training all the vectors are going to go all at once in a single matrix.

Wheras during generation, the next words will be generated sequentially.

Training Diagram

Multiple Attention Heads taking split Q_enc and K_enc from the encoder, and Vi generated from Masked Multi head attention and concatenation procedure.
The Value vector is generated with One: during training as a matrix. Or Sequentially using Embedding layer.

The obtained V is then transformed using the normal multi head attention mechanism as follows.

Multi Head Attention With Encoder outputs.

The output as Z’s are then concatenated once more in order to produce logits and use those to predict the next words based on the previous tokens.

Using previous vectors to predict next words.

If the procedure is in the training phase, then all the words will be sent at once at a matrix and the entire procedure will happen at once giving is an output of shape (count, corpus_size). Each token being predicted from the previous output.

Because of passing the entire output from data as a matrix into the masked layer during training, we are predicted the next tokens, not based on previously predicted tokens by our transformer, but instead the correct tokens that should’ve been predicted by the transformer. This results in better weight updates, and prevents the transformer from learning incorrect repeated patterns.

During generation phase tokens are converted into word embeddings and then used directly to produce value vectors which use K_enc and Q_enc from the encoder to produce the next token. Then in the next iteration, the current predicted token and the previous tokens together are sent as inputs to produce the newer token.

During testing, the words go on in a sequential manner, using previous tokens to predict the next, taking input context from the encoder and predicting the next based on the previous words + input context.

At the end of the day this is what the entire pipeline looks like

Entire training or Testing cycle in Transformer Decoder architecture.

As you can see, the output during generation is fed back to the input layer. During training however, the entire sentence goes all at once, and each row in the matrix represents the current word and words before the current word that were supposed to be predicted.

Transformer Pro’s

  • One of the pro points to the transformer architecture is that because of the attention heads being split, the computations can be run in parallel. The issue of optimizing GPU usage for faster computations with LSTM is resolved.
  • Although the transformer decoder is sequential during text generation, predicting the next word based on the input context, it does not do so during training. The entire input and output goes all at once during training and the weight updations happen as soon as the input and outputs are processed and the results are produced.
  • The Attention mechanism makes it so that word correlations are well established. Because of this not only are the embeddings are well tuned to the meaning of a specific word in a certain context, but also because the transformer looks at the entire input AT ONCE, it doesn’t really have the issue of the effect of a certain word diminishing over “time”.

And that is what makes up a transformer architecture.

Overall Transformer Architecture

There it is. The entire transformer architecture broken down and then explained. You might not have completely captured what exactly is going on in first try, and might be a little confused at this point, if this is your first time reading an article about transformer architecture.

Don’t worry, that’s normal and happens to everyone. I was stuck with understanding the architecture for over a week. It certainly isn’t something you can grasp by reading it all at once. I’ve tried my best to express the entire flow in one go with covering the reasoning of why they’ve arranged it the way they have. In the next chapter we’ll look at a few implementations of the transformer architectures like BERT, Mixtral, LLaMa etc and how they made it slightly better.

One last piece of information I’d like to put out, that transformer encoder and decoders can be stacked on top of each other. So there doesn’t necessarily have to be one single encoder or decoder. The Q_enc and K_enc from the final encoder is sent as an input to every single decoder in the decoder architecture.

IN Addition to that, you can also have independent architectures with transformers, like BERT which is just an encoder only architecture or GPT which is a decoder only architecture.

See you in the next article! Toodles!

--

--