LLM Reading List: Understanding “Attention Is All You Need” Part II

12 min readApr 20, 2023

I recently began a project with a simple question:

Can ChatGPT explain itself?

Each week we’ll be analyzing one of 5 important papers that ChatGPT recommended to explain itself. We’re starting off with the foundational paper Attention Is All You Need.

This article is the second part of a two part dive into the topic. The first article covered why we should care about this paper, and what the state of the art was when it was published. This article will go into the contents of the paper itself. I highly recommend you go read the first part, as it really helps to understand the subjects we’ll be talking about here. If you already have a decent understanding of the subject matter it’s not required reading, but I do think it’s fascinating.

If you haven’t yet, make sure you subscribe so you can get notified when each article comes out! That being said, let’s dive right in.

Critique of Previous Work

To the authors of the Attention paper (all of whom, I should mention, worked for either Google Research or the famous Google Brain Team) there were two critical problems with Natural Language Processing (NLP) models as they existed at the time:

Sequential models such as the Recurrent Neural Network cannot be parallelized (spread across multiple processors). If you’re going to read each word in a sentence in order, you cannot have some of the words in another location. As the length of the desired sequence gets longer (a page, or even a book), you run into memory limits.
Convolutional Neural Networks, while they are a bit more parallelizable, have difficulty relating words that are far apart. The complexity of the computations grows as the distance between two words increases.

What was needed was an approach that did not require an ordered sequence, and required the same computational complexity to find relationships between words, regardless of their distance apart.

Key Concepts

The Transformer

As was mentioned in the previous article, the primary contribution of the Attention paper is a model structure or “architecture” called the transformer, and I’m sure that at this point you are anxious to actually see the thing we’re going to be discussing. I’ve got a picture of it below, which you can take a look as soon as you promise not to run screaming. I swear by the time we’re done with this article you’ll understand what it does.

Do you promise?

Ok. Here:

For the moment, you can ignore what each of the boxes says and just look at the overall structure. If you squint and turn your head sideways it kind-of looks like that RNN model we discussed in the last article doesn’t it?

The encoder/decoder from “Learning Phrase Representations using RNN Encoder — Decoder for Statistical Machine Translation”

The Transformer actually does have a bit in common. In both instances, we have two parallel processes running simultaneously. They even have the same names. The stack on the left-hand side of the Transformer is called the Encoder, and the right-hand side is the Decoder.

The major difference between the Transformer and the RNN is that the Transformer is not sequential. At least not in the traditional sense. Instead of giving it one word at a time in order, we process the whole sequence in parallel and try to learn the relationships between words in another way (attention which we will cover in a bit).

We can actually demystify some of the layers immediately as well. This first step on the left simply means that we take the incoming word and transform it into its embedding, a vector of numbers representing the word, a process we mentioned in the previous article.

Any time we see the term feed forward, this is simply a standard neural network layer. It just means “pass the input through some neurons”.

Add and Norm means exactly what it says. We add together all the things coming into this box and then normalize them. A normalization function is something that helps to make data more consistent and comparable. In this case they were using a technique called layer normalization.

You’ll also notice that sometimes the arrows seem to shortcut some of the boxes, such as below:

This means exactly what it looks like. We take the input, pass it through the feed forward layer, and then add that back to the original input. If you like mathematical representations better, the shortcut arrow means:

Moving to the right hand (decoder) side we begin by taking the word in the output that comes just before the one we are predicting and converting that to its embedding.

So if we are trying to translate the Icelandic sentence “Góðan daginn” (“Good day”) into english, and we are currently working on word “daginn”, then this is the embedding for the english translation of the previous word “Góðan” which is just “good.”

Finally, most complicated neural networks (for example an image classification models) end with a final step where we combine the outputs together in a linear layer (which is the standard neural network layer, much like feed forward), and apply a final transformation to get them into the format we want. In this case we use softmax to turn a vector of numbers into a vector of probabilities, probabilities that describe which word is most likely to come next.

This leaves us with only two pieces we don’t understand. First is the positional encoding. For now we can think of that as simply adding in information about the relative location of the word.

The second is the Multi-Head Attention and its “Masked” variation, which is also a process we’ll tackle in the next section.

Armed with all this information, lets take a look at the original image again and see if we can explain it with words.

So on the left hand (encoder) side: First we will take the input word, transform it into its embedding, and add some information about its position. From there we are going to take that and pass it through the attention process, adding the result back to the original input because of our shortcut arrow, and normalize it. We take that and pass it through a normal feed forward neural network, adding it to the unaltered input and normalizing again.

On the right hand (decoder) side we will take the previous word we predicted, transform it into its embedding, and add some information about its position. We pass it through attention, adding the original input and normalizing. Now we combine that with our encoder side, pass everything through attention again, adding the original input and normalizing again. Finally we have one more feed forward/normalization layer for everything, before we transform it into our vector of word probabilities.

It’s long, but no individual piece is all that tough to understand. The only thing we have left to do is explain what positional encoding and attention actually mean.

Positional Encoding

The first thing we will tackle is the positional encoding, because it is a little easier to understand. Since our transformer is not sequential, we need to provide it some information about the relative position of the input word in the sequence. Is it the 5th word in the book or the 555th? We can accomplish this with waves.

We are going to take advantage of two things

Waves repeat
Waves with different frequencies repeat at different rates

If we look at the two waves below, starting at position (0,0) in the center of the images, we can see that as we move to the right they oscillate between 1 and -1 at different rates.

By Edward BallScreen shots of free software available under GNU general public license https://academo.org/demos/virtual-oscilloscope/

The bottom image is about twice as fast as the top one. We can use this to count. If we move to the right two ticks (so x = 2), we see the top wave is at about y = 2, and the bottom wave is at about y = 4. If we move right to x = 4, we see that the top wave is at about y= 4 and the bottom wave has peaked and returned to about y = 2.

If we wanted to transform this into a positional encoding, we could simply provide the y value of each wave at a given value of x. We can code position x = 2 as [2,4] and position x = 4 as [4,2].

The advantage of this method is that it gives us more flexibility. If we just give the absolute positions of the words (1st, 5th, 555th), we are forced into a narrow set of relationships; such as a linear equation like 2 * position + 5. Even slightly more complex relationships like log(position) are still quite limiting. Using multiple waves that don’t all align, we can allow the importance of position to vary in a non-linear way.

(Self) Attention

The concept of attention in language modeling is meant to address the problem of relating different words to each other. It is more correct to refer to what is happening in the Attention is All You Need as self-attention. “Self” in this case simply means that we will be applying the process of attention to only one sequence at a time. A sequence could be a sentence, a paragraph, or a whole book. Self simply means that if our sequence is a sentence, we only look at once sentence at time. If it’s a book, we don’t consider other books.

The question we are trying to address is: How do we know which input in a sequence is the most important to the current output step?

If we consider the task of moving between languages, we do not always get a one-word to one-word translation, it may take multiple words to express a single concept, or we may be able to express a complicated sentence in a single word. The Finnish language is a great example of this.

Maalaavatko — Do they paint?

We also have to deal with the fact that the order of words can change. Take the following sentences in English vs. Norwegian. Norwegian expresses a question by swapping the position of the actor and the action. The first sentence is fairly easy to translate step by step. Du is “you” and leser is “are reading,” but what about the second sentence?

Du leser — you are reading
Leser du? — are you reading?

What we need is a way to tell the model which words in the input are important to the current output step, and how important they are. Described in words we need to:

Determine which words in the input are the most important to our current output step
Score them based on that importance
Incorporate some information from the important words into our model.

The Attention is All You Need paper handles this process with matrices. Specifically we have three matrices. Each is learned when the model is trained, so we don’t have to know the answers in advance. The matrices are the query, key, and value matrices, and they all take the same format. Every word in the current sequence is represented with a vector of numbers. These vectors can be different for the same word in each of the matrices. The query, key, and value differ primarily in their use.

Query : This is the matrix used to ask a question. Basically if the word we are interested in translating is “leser” then we would look for the entry in the query that represents “leser”
Key: This matrix is used to answer the question. If the word we are trying to translate is “leser” then we would take every entry in the key except the one that represents “leser”
Value: This is the information we send forward if a word is selected as important. If we decide that “du” is important for the word “leser” then we would send the entry for “du” in the value matrix forward.

Because this is a computer, we have to express each of these steps mathematically. As you can see in the image below on the left, this is done with matrix multiplication. We take the query for the current step and multiply it by the key matrix. We then scale the result so it falls in a reasonable range, and if necessary we apply a mask, which you may have noticed in the above transformer diagram. The mask simply hides the outcomes for words we haven’t seen yet (it keeps the model from “cheating” by looking ahead to information it doesn’t have yet).

We apply a softmax function to get a matrix of probabilities representing how likely it is that each word is important, and multiply these probabilities against the values matrix to get our final result.

The image on the right shows what the paper calls multi-head attention, which is the same process we just described, but with a slight twist. Instead of preforming attention just once, we apply it multiple times, each on a slightly different version of the query, key, and value matrices, which have been altered by passing them through a single neural network layer first.

Advantages of Self Attention

It’s important to remember that the Google team did not simply set out to create something new. In order for the transformer to be useful, it needs to solve problems with the existing methods. Self-Attention makes three specific improvements over RNNs and CNNs:

The individual mathematical operations involved in the attention process are less complex than RNNs and CNNs, making it faster.
Because it is not sequential it can be easily processed in parallel on multiple machines. We don’t have to wait for the previous word to be processed in order to work on this word.
Since the attention mechanism uses the query, key and value matrices, it’s easier to learn the relationship between distant words. For an RNN to learn the relationship between words that are 5 positions apart, it must process the words in between. CNNs suffer the same problem (albeit in a different way). Attention can compare them directly by multiplying the query and key matrices.

The end result of is that Transformers are not just faster or more accurate, they are both.

Results and Contributions

When tested against state-of-the-art CNN and RNN models, there was no contest. For both English-German and English-French translation tests, early Transformers were already more accurate than the best models seen previously while being faster to train and 1/4 of the cost.

This flung open the doors of possibility. Because they were parallelizable, you could now solve training bottlenecks by adding more computing power. The reduced cost and increased speed made it possible to consider training truly massive models for the first time. The first step had been taken towards a Large Language model.

Conclusion

Wow that was a long journey. If you stuck with me through the whole thing then I commend you. Together we have learned a bit about:

Why we should care about Attention is All You Need
The history of NLP
RNNs and CNNs
Transformers
Positional Encoding
Self-Attention

We now have a mastery of Transformers which are the “T” in GPT. The next paper on our list is “Improving Language Understanding by Generative Pre-Training” by Radford et al. (2018), which will explain the “GP.”

Did this piece help you understand Attention? Let me know in the comments. Make sure to subscribe to receive an update when the next piece comes out!

The Author

With a Bachelors in Statistics, and a Masters of Data Science from the University of California Berkeley, Malachy is an expert on topics ranging from significance testing, to building custom Deep Learning models in PyTorch, to how you can actually use Machine Learning in your day to day life or business.

References

Alammar, J. (n.d.). Visualizing a Neural Machine Translation Model (mechanics of Seq2seq models with attention). Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) — Jay Alammar — Visualizing machine learning one concept at a time. Retrieved April 18, 2023, from https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.