Large Language Models & Transformer Architecture: The Basics

15 min readJul 25, 2023

I. Overview

Let’s start by going through a few basic concepts in the Generative AI world.

AI -> ML -> Generative AI -> LLM — Source: Unlocking the Power of Generative AI Models and Systems such as GPT-4 and ChatGPT for Higher Education A Guide for Students and Lecturers Unlocking the Power of Generative AI Models and Systems such as GPT-4 and ChatGPT for Higher Education A Guide for Students and Lecturers

1. Generative AI

Generative AI is a branch of artificial intelligence mainly focus on generating new content. While traditional machine learning models are trained to recognise patterns, make predictions, or classify information, Generative AI is all about creation (hence the term “generative”). The types of content could range anything from images, music, or, importantly for our discussion, text. Within the vast domain of machine learning, think of Generative AI as the segment focused on outputting new, original content based on its training.

2. Language Model

Simply put, a Language Model (LM) is an AI model trained to understand and generate human language. By processing vast amounts of text data, it learns grammar, vocabulary, context, and nuances, enabling it to produce relevant text based on a given prompt. For example, it can help autocomplete sentences, generate responses to questions, or even write articles.

3. Large Language Model

Large Language Models (LLMs) are essentially beefed-up versions of the standard language models. They’ve been trained on a significantly larger scale, processing more diverse and extensive datasets. Because of this vast training, they can understand context more deeply, generate more coherent text, and handle a wider array of linguistic tasks. Notable examples in this category include OpenAI’s GPT-3 and GPT-4.

4. Generative AI & Large Language Model

How does Generative AI relate to LLMs? In essence, LLMs are a subset of Generative AI. As previously mentioned, they’re simply specialised models that focus solely on text, generating sentences and paragraphs that can mimic human-written content. When we talk about Generative AI in the context of producing written content, we are often referring to these language models.

5. Foundation Models

In essence, a foundation model is like the general education we all receive early in our lives. It provides a broad and foundational understanding of the world, but not a specialisation. Similarly, foundation models in AI are pre-trained on enormous datasets, allowing them to grasp a wide range of general patterns and information. Their generality allows them to be fine-tuned or adapted for specific tasks with smaller, task-specific datasets. In many ways, LLMs can be seen as a type of foundation model. Examples include GPT, BERT, Llama, etc.

II. Transformer Architecture

1. What is it?

The Transformer Architecture was proposed in the Attention Is All You Need paper in 2017 and has been become quite a popular choice for building LLMs. The architecture allows models to learn the context of ALL words in a sentence, not just limited to neighbouring words, using the attention mechanism. By assigning attention weights to those relationships between words, models can learn the relevance of each word to each other words, regardless of where they are in within the input text. Based on the weight values, the models can selectively choose to focus on, or attend to, certain words more than others when making predictions. These weights are learnt during training often with a massive amount of data.

It’s worth noting that unlike traditional NLP models that process text word by word sequentially, transformers can attend to multiple words at once, which enables parallel processing and consequently, perform better.

A couple of important things which really set this architecture apart and make this it revolutionary are:

Self-attention: This allows the model to weigh the importance of different words in the input sequence relative to a particular word, hence, “understand” a word in the context of the words around it. With training, this would build up sort of an internal representation or understanding of language automatically. The better this internal representation, the better it gets at any linguistic task.
Positional encoding: Transformers need a way to capture the order of the words from the input text — so it stores information about the order in the data itself (encodes the position and adds to the input embeddings). Then as you train with more data, the model will then be able to learn how to interpret those positional encodings better.

Transformer models can be considered to be semi-supervised learning because essentially they are pre-trained with large, unlabelled datasets, and then fine-tuned with supervised training to improve their performance.

There are 2 main parts to this architecture: Encoder & Decoder. They share some similar core components — input tokenisation, embedding layers, positional encoding, self-attention layers, feed-forward network, etc.

Let’s try to use some analogies to understand at a high-level each core component before diving deeper into the technical details.

Transformer with Analogies

Tokenisation: Splitting a recipe into individual ingredients — Just as you dissect a recipe to understand the individual components (like eggs, flour, sugar), tokenisation breaks down a sentence into individual pieces or tokens. This helps the model process each word or sub-word distinctly.

Embeddings: Translating words into a secret code — Imagine words having a unique secret code that encapsulates not just the word, but its relationship and nuances in meaning. Embeddings convert words into dense vectors, which are like these secret codes, capturing semantic information.

Positional Encoding: Time-stamping each photo in a photo album — Just as timestamps tell us the sequence in which photos were taken, positional encodings provide context about a word’s position in a sentence. This ensures that the model knows the order of words, crucial for understanding meaning.

Self-Attention: In a group conversation, remembering who said what — Imagine a lively discussion where statements made by one person relate or refer to another’s. Self-attention is similar, determining relationships between words in a sentence, and how much each word should influence another.

Multi-Head Attention: Several detectives looking at the same case — Picture multiple detectives examining evidence for a case. Each one notices different details. Similarly, multi-head attention has multiple “attention heads” that focus on various parts of the input, offering diverse perspectives.

Feed-Forward Network: Passing a raw gemstone through different craftsmen — A gemstone becomes refined as it’s crafted by various experts. Similarly, the feed-forward network refines the information from the attention mechanisms, transforming and shaping it further.

Encoder: A librarian organising and cataloging books — The librarian ensures books are well-organised so information can be accessed efficiently. The encoder processes and organises input data, creating representations that decoders can use effectively.

Decoder (with Masking): Solving a jigsaw puzzle piece by piece — When solving a jigsaw, you use existing pieces as references without knowing the whole picture. The decoder, especially with masking, predicts the next word in a sequence based on previous ones, without “seeing” future words.

Residual Connection: A student attending a lecture with prior knowledge — who revisits topics can merge new knowledge with the old. Residual connections in transformers allow layers to retain original input information, merging it with new transformations.

Layer Normalisation: Standardising all runners to the same starting line. To ensure fairness in a race, all runners start at the same line. Layer normalisation balances and standardises activations (outputs) in the network, ensuring stable and faster learning.

2. Input Tokenisation

The first thing we need to do is to tokenise the input words. Tokenisation is the process of converting raw text into smaller chunks or tokens, which can then be processed by the models.

Source: “Generative AI with LLMs” course on DeepLearning.AI

Once tokenised, each token is converted into numbers/unique IDs, each number represents a position in a dictionary of all the possible words the model can work with (model’s vocabulary). This vocabulary is established during pre-training and should contains thousands if not millions of unique tokens.

3. Embedding Layer

Once input is represented as numbers, we pass it to the embedding layers. Raw token IDs from the tokenisation process aren’t particularly useful for the models. The numbers don’t contain semantic information about the words they represent.

This layer is a trainable vector embedding space, where each token is transformed into a vector (aka embeddings) in order to capture the semantic essence and encode the meaning/context of a token in a series of numeric values, and the vector occupies a unique location within that space. The processing of these input tokens is designed to be done in parallel as well.

Words with similar meanings or that appear in similar contexts will have embeddings that are closer together, and words with different meanings will have embeddings that are further apart. Examples: words like “apple”, “banana”, “orange”, etc will be quite close together in the space, while at the same time being quite far away from “car”, “train”, “bicycle”, etc.

The vector size is usually 512, which is quite large but can still get larger.

You can imagine the vector size as the number of dimensions in the embedding space. A higher dimensionality allows the model to encode more intricate relationships and subtleties, but it also requires more data and computational power to train effectively.

As a simplified example, use a vector size of just 3, you can then plot the words into a 3D space and see the relationships between those.

These dense vector representations aren’t just randomly assigned. They’re learned. When training a language model (or any model that uses embeddings), the initial embeddings might start off random. However, as the model trains on vast amounts of text, it adjusts these embeddings to minimise its prediction error. Over time, the model places words with similar meanings or that appear in similar contexts closer together in this embedding space because doing so helps it make better predictions.

4. Positional Encoding

Transformers are designed to handle all parts of an input sequence simultaneously, in parallel. This means they can’t inherently recognise the order of tokens. Without a mechanism to consider sequence information, all words would seem independent and context-free.

Source: Transformer’s Encoder-Decoder: Let’s Understand The Model Architecture

At a high level, positional encoding involves adding a uniquely generated vector to each input token’s embedding. The Attention Is All You Need paper mentioned that the positional encodings have the same dimension/length as the embeddings, so that the two can be summed. It’s a way to ensure that, despite processing all words in parallel, the model doesn’t lose the sequential essence of the language.

This vector is determined solely based on the token’s position in the sequence, irrespective of the token itself. The goal is to ensure that the sum of the token embedding and positional encoding is unique for every position and provides enough information for the model to determine the original position of each word.

Source: Transformer’s Positional Encoding: How Does It Know Word Positions Without Recurrence?

Imagine trying to translate a sentence from English to French. The placement of adjectives and nouns often differ between the two languages. To produce a fluent translation, the model must recognise these positional differences. By being aware of the position of each word through positional encoding, the Transformer can produce contextually accurate translations, preserving the word order nuances between languages.

4. Self-Attention Layer

Once you’ve summed the input tokens and the positional encodings, you pass the resulting vectors to the self-attention layer, where the model analyses the relationships between tokens in your input sequence.

Imagine reading a sentence and, for every word you read, being able to have a sense of how important every other word in that sentence is, in understanding the current word’s context. That’s essentially what self-attention does.

For every word in a sentence, self-attention assigns a weight to every other word, indicating how much “attention” it should pay to that word, allowing the model to attend to different parts of the input sequence to better capture the contextual dependencies between the words. If a word has a high attention weight relative to another word, it means the model believes the two words are closely related in the given context.

It’s worth noting that it can assign attention to words close by or words far apart. This enables the model to capture both short-term and long-term dependencies in the data, making it particularly adept at understanding the context in long sequences like sentences or paragraphs.

As an example, if a model is trying to understand the word “bank” in the sentence “He sat by the river bank,” the attention mechanism can highlight the word “river” as particularly important (higher weight) for contextualising “bank.”

Multi-head Attention

In the Transformer architecture, this self-attention mechanism isn’t applied just once. Instead, multiple sets of self-attention weights (aka heads) are learnt in parallel independently of each other. The intuition here is that each head can learn to pay attention to different aspects or relationships in the data.

For instance, one head might focus on syntactic relationships, another on semantic relationships, another on the relationship between people entities in the sentence, activities being described and yet another on specific word patterns or rhymes. Each head produces its own output, and these outputs are concatenated and linearly transformed to produce the final multi-head output. This output is a richer representation of the input, capturing various aspects and nuances.

Take an example of this sentence:

“The cat sat on the mat.”

Assume our model has a dimension of 512 (let’s call the value d_model), 2 attention heads and operates on words rather than sub-words. The input embeddings going into the attention layer looks something like

| Word | d_model (512)            |
|------|--------------------------|
| The  | [0.1,0.2,...512th value] |
| cat  | [0.3,0.1,...512th value] |
| sat  | [0.9,0.2,...512th value] |
| on   | [0.5,0.6,...512th value] |
| the  | [0.1,0.2,...512th value] |
| mat  | [0.4,0.8,...512th value] |

The first attention head focuses on the activities “sat”, so it might weight more on the subject of the activity “cat” and the location of the activity “mat”. The second head might pay attention to the relationship of different entities — “cat” and “mat”. The computed output for each head for the word “sat” might look something like:

| Head   | d_model (512)                                  |
|--------|------------------------------------------------|
| Head 1 | Weighted combination of ["cat", "sat", "mat"]  |
| Head 2 | Weighted combination of ["cat", "mat"]         |

After the multi-head computation, we get the outputs from each head for the word “sat”, then, we combine them by concatenating the vectors — the result will be a vector of size 2 * 512 .

concatenated = [value1,value2,...,512th value,value1,value2,...,512th value]

Now, we can’t feed a vector of size 1024 to subsequent layers that expect a size of 512. So, we use a linear transformation (basically, a matrix multiplication) to project this 1024-sized vector back down to 512.

combined = concatenated x weight_matrix

The resulting combined vector goes into the subsequent step of feed-forward network. Remember that this is repeated for every word in the input sentence.

The number of attention heads included in attention layer varies from model to model. Common choices might be 8, 12 or even up to 100 heads.

It’s not recommended to dictate ahead of time what aspects of language the attention heads will learn. This allow the different attention patterns to emerge naturally during training. The weights of each head are randomly initialised and given sufficient training data and time, each will learn different aspects of language. The flexibility is quite crucial for the model’s ability to adapt to a variety of tasks.

5. Feed-Forward Network

Now that all of the attention weights have been applied to your input, the output is processed through a fully-connected feed-forward network (FFN).

A Feed Forward is an artificial neural network in which the connections between nodes do not form a cycle….The feed forward model is the simplest form of neural network as information is only processed in one direction. While the data may pass through multiple hidden nodes, it always moves in one direction and never backwards.
From “Feed Forward Neural Network” by DeepAI

The FFN layer applies a linear transformation to the output that we just receive from the multi-head attention layer, followed by a non-linear activation function, and then another linear transformation (for the detailed math equations, please see the original paper). This means it can capture non-linear relationships of each word in the input. Without the non-linear component, the network is not going to be very powerful as it wouldn’t be able to learn complex patterns or relationships in data beyond simple linear relationships.

In simple terms, after the attention mechanisms figure out the “context” and relationships between words, the FFN applies a consistent set of transformations to refine that contextual information further. To draw an analogy: if the attention mechanism is like identifying the key relevant ingredients for a dish, the feed-forward network is like fine-tuning the flavours and making sure everything blends well together.

Depending on whether the FFN is in encoder or decoder component, the output can be different. The output of the FFN layer in the encoder is a sequence of transformed embeddings.

On the other hand, for decoder, the output is logits (the raw predictions which come out of the last layer of the neural network) , indicating the probability score for each token in the model vocabulary dictionary.

The logits can then be passed to a final softmax function (Softmax is a mathematical function that converts a vector of numbers into a vector of probabilities, where the probabilities of each value are proportional to the relative scale of each value in the vector), where they are normalised to produce probability distribution for each possible next word/token. One token will have higher score than the rest, so it will most likely be the predicted token.

6. Encoder & Decoder

So that was all of the core components, now we can put them together to form the encoder and decoder with some additional tweaks.

In terms of the encoder, the majority of the components have been described previously:

Input tokenisation
Embedding layer
Positional encoding
Multi-head self attention
Feed-forward network
Residual connection & layer normalisation: Each of the steps (self-attention and feed-forward neural network) in the encoder is surrounded by a residual connection and followed by layer normalisation. This helps in stabilising the activations and allows for deeper models by adding information carried over from from the previous layers.

Stacking encoders: Transformer model doesn’t just have one encoder, it can stack multiple encoders together (typically 6). This means as information flows through these layers, the model is able to learn more and more about the relationships and patterns.

The output of the final encoder layer is considered input features for the decoder — it is a context-rich representation of the input, wherein each token’s representation is influenced by its surrounding tokens.

The decoder component is mostly similar to encoder but it works on the target output sequence, and with some slight differences in the layers:

Input tokenisation: There are 2 primary inputs — one is coming from the encoder, and the other is the ongoing sequence that decoder is generating
Embedding layer
Positional encoding
Masked multi-head attention: The attention mechanism is slightly different from the encoder’s in that the input is partially masked. The “masked” part ensures that while predicting a particular word, the model can only attend to earlier positions in the sequence (or the current position) but not future positions (masked positions get 0 score for attention). This ensures that the prediction for a particular word doesn’t depend on future words in the sequence, preserving the auto-regressive property of the decoder.

Encoder-Decoder Attention Layer: This layer calculates attention from both the encoder’s output and the partial output continually generated by decoder (output token is passed back to the input to trigger the generation of the next token, until the model predicts an end-of-sequence token).
Feed-forward network
Residual connection & layer normalisation: similar to that of encoder.
Output: the output of the FFN, as previously mentioned, is the logits, which is passed through the softmax function to produce a probability distribution over the vocabulary, determining the next word in the sequence.

In essence, while the encoder’s role is to represent and compress the input data into a context, the decoder’s job is to unfold this context into a meaningful output sequence, attending to the most relevant parts of the encoded input and its own previous outputs as needed.

You can technically split these components apart to create variations of the architecture. In fact, there are a number of models that are encoder-only, encoder-decoder or just decoder-only.

Source: An In-Depth Look at the Transformer Based Models

Resources:

Until next time. Happy reading!