LLM Reading List: Understanding “Attention Is All You Need” Part I

11 min readApr 20, 2023

I recently began a project with a simple question:

Can ChatGPT explain itself?

I asked it to tell me what papers I should read in order to understand how Large Language Models (LLMs), specifically GPT, worked. In return I received a list of 5 important papers. Each week we’ll be reading one. I’ll synthesize the key points of the paper, and try to explain it in a way that’s decently understandable. Don’t forget to subscribe so you can follow along!

This series is aimed at people who are familiar with Data Science and AI concepts, but not necessarily experts or researchers. The only assumption I will make is that you understand the basic concept of a neural network. If that’s not you, then I recommend reading just a little on the topic first.

There won’t be any code in this series, and as few equations as I can manage. Just a discussion of the paper concepts, and what we can learn from them. I highly encourage you to read the papers yourselves as well.

We’ll be kicking off the series with the foundational paper that revolutionized not just language models, but computer vision as well:

Attention Is All You Need

This article will be the longest of the series because we have a lot of ground to cover to set the stage for our future exploration. I’ve decided to split the article up into two for ease of reading. Part one will be on why we should read this paper, along with the necessary background. Part two will be the actual contents of the paper.

Introduction and Significance

The major contribution of Attention, at its most basic, was the proposal of a new kind of structure (usually referred to as the “model architecture”) for language models. This new architecture, called a Transformer, solved some of the biggest problems with existing models by eliminating much of the complexity involved in what was, at that time, the State-of-Art. The only thing they left behind was a mechanism called attention, hence the title Attention Is All You Need.

Before we go into the details though, it’s worth explaining why we should care.

It’s difficult to truly comprehend how important this paper has been to the fields of Artificial Intelligence and Natural Language Processing (NLP), but it helps to start with some numbers.

According to the NASA-ADS paper repository, there are 2,569 papers that cite Attention is All You Need.
Semantic Scholar has the paper with 11,883 “Highly Influential Citations” (where the cited paper is a major influence on the work it’s cited by).
Google Scholar says it’s been cited 71,465 times.

For reference, the median number of citations for a research paper is four.

If, on the other hand, you’re more interested in applications and performance, you can take a look at current the state-of-the-art for NLP benchmarking tasks. At the top of almost every list is a model that is built on, or evolved from, Transformers.

Usually most of the top 10 ranked models are based on the architecture.

So clearly it works and people think it’s an important concept, but… what is a Transformer actually?

Background

Before jumping into the Transformer specifically, we’ll need just a little bit of understanding about what came before. The very first stage of NLP research was in logic and rules based programming. Language is a logical process, and so therefore we should be able to explain to a computer how it works. Unfortunately, this approach hit a wall fairly quickly. Consider a sentence that renowned linguist Noam Chomsky found particularly perplexing:

Colorless green ideas sleep furiously.

You and I both understand this sentence is wrong, but it is grammatically correct. It’s wrong because of the individual interactions between specific words. Things that are colorless cannot be green and ideas don’t sleep at all, let alone furiously. But there are thousands of words in the English language. Explaining how every single word interacts with every other word in every context and combination is a nigh impossible task. Expanding this process to include every known language borders on insanity.

So the focus shifted. Perhaps instead of trying to understand the entirety of human language, the problem could be treated as probabilistic. Given a set of preceding words or a specific sentence structure, try to predict the next word, or fill in a gap in a sentence. This could actually be done fairly well, all you needed was enough text to build a huge probability table of what words tended to follow other words.

If you wanted to read and understand text, for example to answer questions based on the contents of a paragraph or summarize a book, you needed something more complex than a probability table. Something that captures not only how often a word occurs, but its meaning in the context of the words around it. We could spend a lot of time delving into how difficult it is to represent words to a computer, which only understands numbers. We won’t do that, but I think this MIT Technology review article from 2016 explains it nicely. What we are left with in the end is a three step process:

Somehow translate a word, set of words, paragraph etc. into a vector of numbers that capture the meaning, context and part of speech (called an embedding).
Do math on those numbers.
Translate the numbers back into words.

At the time of the Attention paper, there were two leading approaches, Recurrent Neural Networks and Convolutional Neural Networks.

Recurrent Neural Networks

The idea of a Recurrent Neural Network (RNN) is not new. The first paper on the idea was published in 1986, and one of its major developments, the Long-Short-Term-Memory (LSTM) network is from 1997. The concept is pretty simple to explain: an RNN is any network that has a “memory” of the examples in a sequence it has seen before. The way it processes the example it sees at time t is directly affected by what example it saw at the previous times t-1, t-2, t-3…

This makes it different than other neural networks because now order matters. RNNs are very useful for time based applications. Take the stock market for example. There are certainly a lot of different factors that might influence how a stock performs today, but one of the most important is going to be “how has the stock been doing in the past?”

This makes intuitive sense when we shift to language. Let’s say we want to translate from one language to another. How I translate a specific word is going to depend on the words that came before.

For example, I speak decent Norwegian. If I encounter the word “gift”, how I translate that word is going to depend pretty heavily on the words that proceed it…

Some folks might say that these are the same thing

The way that this memory is accomplished is by using something called a hidden state, usually denoted as “h”. This is nothing more complicated than a matrix that contains information about the things we’ve seen so far in the sequence. So the process goes:

Feed the RNN the current word we want to translate, along with the current hidden state h
Translate the word, and update the hidden state with information about the outcome
Take the new hidden state and use it to inform translation of the next word.

By fdeloche — Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=60109157

The end result of this is that when we reach time t, the output we receive is influenced by the preceding words. Especially the most recent word.

When Attention was published, the most advanced version of this technique was called an Encoder/Decoder model, and it used an RNN not only to translate (decode) a word, but also to learn how we could best represent (encode) the words we want to translate.

Let’s consider an English to French sentence translation as an example. First the Encoder moves through the English sentence. As it sees each word it updates its hidden state to reflect what English words it has seen. After it has viewed the entire English sentence, it passes its hidden state over to the Decoder. The Decoder then uses this information to produce the first French word and updates h. It uses h and the previous French word to produce the next French word until it has produced the entire French sentence.

The encoder/decoder from “Learning Phrase Representations using RNN Encoder — Decoder for Statistical Machine Translation”

The results of this were huge. Models based on RNN Encoder-Decoders were immediately state-of-the-art for language translation tasks. They were able to more fluidly and contextually perform sentence to sentence translation, and were more accurate in testing.

RNNs had one major limitation: because they relied on a sequence, they could not be processed in parallel effectively. The workload couldn’t easily be spread out over multiple computers since you must process all the words in a sequence in order. RNNs also struggle to relate distant words, since the most important influence on the h is going to be the word you just saw. The next approach addressed this issue slightly better.

Convolutional Neural Networks (CNNs)

The CNN approach simply took a method that already worked well for images and applied it to text.

The way that computers “see” an image is as a matrix of numbers where each entry in the matrix is the value for one pixel. For black and white images, this is a single 2D array. For color images it is a stack of three 2D arrays, one each for the red, green, and blue channels of the pixels.

We then scan across the image looking for specific features. The object that does this is called a “filter.” A very simple example of this might be a filter that looks for curves. We create “maps” of where we find each separate feature, turning one image into many images (this process is called convolution). We can then reduce the size of each new image by taking some function like the average or the maximum and using it to “pool” the values of multiple pixels together.

If you’re wondering how we come up with the features, we don’t need to, they are learned by the model during training. After repeating this step several times, we pass the result through a traditional flat neural network to output our desired result.

If this sounds complicated, I find it easier to imagine as a physical process. What we begin with, in the case of a color image, is a 3D shape. It has a height, a width, and three layers of color. Unfortunately a traditional neural network only takes in a 1D input: a vector with a length. So in order to fit our image into this network we need to change its shape. We do this in two steps. First we pull it out to make it longer (convolution) and then we roll it thinner (pooling).

What I’m saying is that we are making a Play-Doh snake. Check out the image below and tell me I am wrong.

https://www.researchgate.net/figure/The-overall-architecture-of-the-Convolutional-Neural-Network-CNN-includes-an-input_fig4_331540139

By Larry D. Moore, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=1022662

We can do the same thing with text if we simply restructure the problem. Instead of an image, we create a 2D matrix of a sentence where each row is the embedding for a word in that sentence. This sentence “picture” can now be convolved and pooled the same way we would with an image.

Image of how a Sentence CNN works from “Convolutional Neural Networks for Sentence Classification” (Yoon, 2014).

If you have sentences of different lengths, this can be handled using a final layer of pooling, since taking the average or max of a set of numbers always returns one number. The embeddings for each word can either be learned during the training process, or you can start off with embeddings someone else has created (such as the popular word2vec method).

The results? Pretty good! At the time of publication, the CNN method beat the state of the art at tasks like classifying one-sentence movie reviews or opinions as positive/negative.

Unlike RNNs, a CNN can pay equal attention to all the words in a sentence, and handles all the words in a sequence simultaneously, so it can be parallelized. There is unfortunately a pretty big downside, and if you’ve been paying extra-close attention you may have noticed it. The complexity of this approach still grows as you try to consider more and more words. Each word you wish to consider adds to the size of the “image,” and the farther apart two words are, the more difficult it is to learn the relationship between them.

This approach works well for a sentence, or maybe a paragraph, but it could not possibly be used for something the size of a book. For that we would need something that is both computationally lightweight, and capable of finding relationships between words that are physically distant from one another. This is task that the authors of Attention had set for themselves.

Take a Breather!

If you made it through this whole article congratulations! I hope you feel a little bit more educated on the shape of the NLP landscape as it existed when Attention entered the scene. As promised, I’m breaking it up into two posts, though both are available now.

So take a break, get some coffee, post your thoughts below, or maybe just disappear for a week! Whenever your ready, lets jump into Part II of Understanding “Attention Is All You Need”

Make sure to subscribe you can see each piece as it’s published!

The Author

With a Bachelors in Statistics, and a Masters of Data Science from the University of California Berkeley, Malachy is an expert on topics ranging from significance testing, to building custom Deep Learning models in PyTorch, to how you can actually use Machine Learning in your day to day life or business.

References

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). https://doi.org/10.3115/v1/d14-1179

Gallagher, S., Rafferty, A., & Wu, A. (2004). NLP Overview — History. Natural Language Processing. Retrieved April 16, 2023, from https://cs.stanford.edu/people/eroberts/courses/soco/projects/2004-05/nlp/overview_history.html

IBM. (n.d.). What are convolutional neural networks? IBM. Retrieved April 16, 2023, from https://www.ibm.com/topics/convolutional-neural-networks

Knight, W. (2020, April 2). AI’s language problem. MIT Technology Review. Retrieved April 16, 2023, from https://www.technologyreview.com/2016/08/09/158125/ais-language-problem/

Lai, S., Xu, L., Liu, K., & Zhao, J. (2015). Recurrent Convolutional Neural Networks for Text Classification. Proceedings of the AAAI Conference on Artificial Intelligence, 29(1). https://doi.org/10.1609/aaai.v29i1.9513

Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar. Association for Computational Linguistics.