The Building blocks of Transformers

Building an intuitive understanding of the Transformer architecture and its major components

8 min readFeb 28, 2024

Note: Images in this article are adapted from deeplearning.ai materials. Original sources will be cited where applicable.

Deliberate focus and attention is key when it comes to studying and achieving complex goals. Similarly, Large Language Models achieve great results at understanding and responding to language thanks to their attention mechanism within the Transformer architecture. In this article, I am going to explore the mechanics of this architecture giving you an intuition of what’s happening under the hood.

Introduced in the landmark 2017 paper Attention is All You Need, the Transformer architecture revolutionised natural language processing, laying the groundwork for today’s advancements in Generative AI (Gen AI).

Transformers vs Recurrent Neural Networks (RNNs)

With all the hype around LLMs, one might think generative algorithms popped yesterday. Like any breakthrough, they built on years of work. Predecessors like RNNs paved the way. While powerful at the time, RNN-based models struggled remembering long chunks of text and needed tons of computing power to work well.

RNNs predict words one at a time, looking at what came just before. This works fine for short sentences, but fails over longer text. It’s like trying to review a whole book after reading only the final sentences — impossible to get the full picture!

RNNs tend to forget essential information if this appears early on in the text. This makes it hard for them to fully understand. Throw in a few homonyms — those deceptive words with multiple meanings — and things get truly confusing.

Attention

Attention came to the rescue! Transformers use it to grasp how words relate to each other in a sentence, no matter how far apart these are. This makes Transformers better at getting the full picture. Besides— we are talking plural — there’s not just one, but many attention mechanisms. Because Transformers process words non-sequentially, they can run attention mechanisms in parallel on GPUs. This leads to faster training and response times compared to earlier models.

Sentence Processing — RNNs vs Transformers

Transformers process words by analysing all the word relationships

Transformers use attention weights to create attention maps. These maps capture the relationship between words during the training phase on large amounts of text. This mechanism is called self-attention. This allows the model to focus on relevant connections within a sentence. For example, here’s how an attention map might look for the sentence above — notice the thicker lines representing stronger connections and thinner lines for weaker ones.

Architecture

Now that we have an intuition over the power of the attention mechanism, let’s dive into the broader Transformer architecture. It’s divided into two distinct components: the Encoder and Decoder. To help us visualise the concepts within the architecture, I’ll reference the simplified diagram from the original paper throughout this article (right hand side diagram).

Let’s uncover how this architecture works by analysing its key components.

Translate words to numbers — tokenise

Machine Learning algorithms are big statistical calculators. They don’t understand words directly but numbers. To bridge that gap between language and numbers we need to use tokenisation. There are different ways to map words to numbers, but it’s important to use the method used to pre-train the model. This way, words become code numbers the model recognises from its own dictionary. On top of that, tokenisation helps the model figure out unseen words.

Notice in the picture below how the German words Ich, mag etc. in the sentence Ich mag weiße Schokolade have each been assigned unique token IDs.

Tokens to vectors — embedding layer

Next, pass the tokens to the embedding layer. This layer acts like a skilled librarian. Having spent years organising a huge library with multiple shelves, the librarian instinctively knows where to find related topics and books. The embedding layer works similarly, taking each token (a word or subword) and placing it within a high-dimensional space. Like the librarian’s carefully arranged shelves, words with similar meanings are shelved close together within this space. While it’s challenging for us to fully visualise this space, the original Transformer used vectors of length 512 to represent word positions while capturing their meaning. Notice in the picture below how each token is allocated a vector.

Each token id is assigned a multidimensional vector

To better grasp the embeddings concept, imagine a vector size of two to capture the relationship between words. Within this 2D space, word meanings would be plotted as points. Words with similar meanings, like chocolate and cake would be placed closer together than unrelated words like AI. To determine the similarity between two words, we can calculate the angle between their corresponding vectors. Smaller angles imply higher similarity, capturing shared context and meaning.

Positional Encoding

Next, add positional encodings. These act like a ticketing machine, assigning each word a unique label that reveals its place in the sentence. Since Transformers process words simultaneously rather than sequentially, positional encodings are essential for maintaining word order and context within the sentence.

Adding positional embeddings to token embeddings

After summing up the token vectors and positional encodings, the resulting vectors are passed to the self-attention layer. Here the model analyses the relationship between tokens and capture dependencies. This layer enables the model to analyze how different words in the sequence relate to each other. For a translation task, it’s like having multiple language experts working together. One expert might focus on grammar, another on word order, and another on the overall meaning. The number of attention heads varies by model, with 12–100 being common.

Multi-headed self attention and attention maps

After calculating attention weights based on words relationship in a sentence, the model selectively emphasises the most relevant elements of the input. This refined representation of the data is then passed through a feed-forward neural network, which further transforms it. The network generates logits, which represent the model’s confidence scores for each word in its vocabulary. Finally, A softmax layer converts these logits into probabilities, revealing the likelihood of each word in the current context. The word with the highest probability is the model’s top prediction.

Different Transformers Architectures

As mentioned earlier, the Encoder and Decoder are two distinct components of the architecture. There are actually 3 common ways LLMs use these as shown in the picture below:

Transformers architecture types by deeplearning.ai

Encoder-Only Models focus on understanding and representing the meaning of input text. They’re great for tasks like: sentiment analysis, entity extraction and text classification. A few model examples here: BERT, RoBERTa and DistilBERT
Encoder-Decoder Models excel at transforming one sequence of information into another. Perfect for: translation, text summarisation, question and answering (q&a). Examples: T5 and BART
Decoder-Only Models generate text one word (or token) at a time. Great for: text generation, chatbots, language modelling and predicting the next word in a sequence. Examples: GPT models and Jurassic.

Encoder-Decoder example

Now that you got a sense of how transformers use the Encoder and Decoder components, let’s take an example to illustrate what happens when we make inferences with an Encoder-Decoder model.

The Transformer architecture was actually originally designed for translation, so let’s translate the following German sentence to English: Ich mag weiße Schokolade.

How transformers process words to make inferences

In the above diagram, the sentence is translated to token ids: 156, 780 .., then these are translated to vectors x1, x2, .. before they are fed into the self-attention layers. The outputs of the multi-headed attention layers pass through a feed-forward network. This data becomes the output of the Encoder. At this point, the encoder has created a deep representation of the input sentence’s structure and meaning.

This representation is inserted into the middle of the Decoder to influence the next token. The output of the decoder’s self-attention layers flows through the decoder feed-forward network and a final softmax output layer. The decoder then repeats this process until a stop sequence is reached, much like reaching the period at the end of a sentence. Finally, the final sequence of tokens is transformed back into words, and there you have your output: I like white chocolate.

Closing words

In this article, we’ve discovered the remarkable power of the attention mechanism and how it revolutionised natural language processing. We have looked back at earlier models in Generative AI and their shortcomings and how Transformers helped overcome those. We’ve delved into the inner workings of the major Transformers components revealing how they collaborate to process information.

The future of Transformers extends far beyond language — their potential shines bright in computer vision, image classification, protein structure analysis, drug discovery, and countless other fields. As they continue to evolve, I have no doubts we will be seeing even more astonishing breakthroughs that will shape the way we interact with machines and the world at large.

Thank you for reading!

Enjoyed the read? Show some love with a few claps below 🙌

References:

Transformers architecture

Generative AI with LLMs

Attention is all you need

How Transformers work