Step-by-Step Illustrated Explanations of Transformer

Yule Wang, PhD
The Modern Scientist
8 min readFeb 27, 2023

My next post “An In-Depth Look at Transformer-Based Models” will deeply explore the training objectives and architectures of these models, including why GPT-1 to GPT-4 utilize a Transformer decoder-only architecture. Additionally, the post will examine why a decoder-based model is a better choice for a unifying network model for downstream tasks.

A BRIEF HISTRY BEFORE TRANSFORMER

Bag of words

Prior to the advent of the Transformer, the bag of words method is a commonly used approach in Natural Language Processing (NLP), treating each word or token as an independent entity in the context. The Termed Frequency-Inverse Document Frequency (TF-IDF) model is an example of this methodology, quantifying the frequency of each word in a document and measuring its rarity or commonness across all documents. While this method provides embeddings for representing documents, it neglects the order of words and context in sentences. One solution is to use n-grams (contiguous sequences of words), but this approach may not work well for complex sentences.

RNN and LSTM

RNN

From a statistical standpoint, the objective is to predict the conditional probability P(wᵢ | w₁, …, wᵢ -₁) of a word wᵢ , given its preceding words w₁, …, wᵢ -₁ in a text sequence. Recurrent Neural Networks (RNNs) can be employed to implement this statistical technique in a neural network setting. RNNs comprise sequential RNN units, each with a word input. RNNs can maintain a memory of past inputs, which allows them to capture the temporal dependencies between words. However, RNNs suffer from the gradient vanishing and explosion problem during back-propagation. Furthermore, RNNs have difficulty processing long sequences due to decaying memory of past inputs over time, and thus hindering the network’s ability to learn long-term dependencies.

LSTM

The Long Short-Term Memory (LSTM) architecture was proposed to address the challenges of RNNs in handling long sequences of data and the gradient vanishing problem. It uses memory cells with forget gates that allow the network to selectively retain or discard past information over time.

However, training LSTMs can be computationally expensive due to sequential weight learning of each LSTM unit for each word, resulting in prolonged training times. LSTM models typically process an average of 200 words, which suggests that handling excessively long text sequences continue to compromise performance.

TRANSFORMER

The Transformer architecture addressed the issue of preserving long-term dependencies by leveraging (a). self-attention mechanisms to retain word-to-word relation and (b). positional encodings to represent each word’s position. This enables parallel computation over the entire text without disrupting the order. The Transformer has an encoder for input text and a decoder for generating text.

Originally designed for translation (“Attention is All You Need”, Vaswani, et al., 2017), the Transformer by Google team has advanced the NLP field with Transformer-based models such as encoder-based BERT, decoder-based GPT, and T5, BART, etc. In the next section, I will elaborate on the encoder and decoder functions.

Fig 1: Transformer neural network architecture. Left part is N stacked encoders for inputs. Right part is N stacked decoders for generating text. (image source: Vaswani, et al., 2017)

A. Encoder

Fig 2: Mechanism of First Stack of Encoder. An encoder stack has four layers: a self-attention layer, two Add & Norm layer and a Feed-Forward layers. (image by author)

(1). Positional Encodings

The positional encoding is a fixed-length vector added to the word i ‘s embedding xᵢ to enable parallel processing of input sequences without disrupting the word order. The new embedding x^(¹) that combines with its positional encoding tᵢ is the input of the 1st stack of encoder and it is

Eq. 1

The i-th word’s positional encoding vector tᵢ’s dimension follows

Eq. 2

where dim is the dim-th dimension of i-th word’s position encoding.

Here is a visualization of positional encodings for words of different position and dimension.

Fig 3: Positional encodings for words of different position i and dimension dim. Image Source: Zelun Wang, et al., 2019

(2). Self-Attention Layer

Self-attention calculates attention scores between different sentence parts, enabling selective focus on the most relevant dependencies.

Fig 4: Attention Score of “its” in One Attention Head . In one of attention heads, the word “its” recognizes the strong relevance with both the word “law” and the word “application”. (image source: Vaswani, et al., 2017, revised)

In the self-attention mechanism, each input word embedding x (dimension dmodel) is transformed into three vectors: query q, key k, and value v by learnable weights. They are mimicking the retrieval process of the dataset by sending a query for a specific word and returning keys of all words (including itself) to retrieve their values. But you can understand it in this way:

query q: A vector (d=dₖ) of a specific word i to calculate the attention score.

key k: A vector (d=dₖ) of each word j in the whole text to attend the dot product with the word i’s q for calculating their attention score and to identify the most relevant word to word i.

value v: A vector (d=dᵥ) that represents the meaning of a word j in another sense, but having a dimension that differs from the word embedding x. In this way, weighted sum of values v can represent the word j of important relevance with word i and can later be added to the word is embedding x^(¹).

(i). One-Head Attention Mechanism:

Fig 5: Single Head Attention Mechanism. (Image by author)

step 1: Each word’s embedding transforms to k, q and v.

step 2: Calculate the attention scores of word i with other words j and get weighted sum of z.

step 3: Repeat step 2 for all words i. Vector z transforms to matrix Z for single head self-attention layer.

With vector q → matrix Q, k → K, v → V, the whole process follows

Eq. 3

(ii). Multi-Heads Attention Mechanism:

Multi-heads attention mechanism performs the self-attention operation in parallel multiple times, each head using different learned matrices query Q, key K, and value V to capture complex relationships between sentence parts from different semantic and grammatical perspectives. Eight heads were employed in the Transformer as indicated in the paper “Attention is All You Need”.

Fig 6: Multi-Head Attention Mechanism. (image by author )

step 1: Different weight matrices Wᵢq, Wᵢₖ, Wᵢᵥ are learnt in each head to transform to the query, key, value matrices Qᵢ, Kᵢ, Vᵢ.

step 2: Eight different Zᵢ is obtained in parallel followed by Eq. 3 in each head.

step 3: Z₁, …, Z₈ of eights heads are concatenated to get a single multi-head attention matrix Zconc with dimension 8d * dₙ.

step 4: Zconc transforms to Zoutput (with dimension 8dmodel * dₙ, to prepare the summation operation between Zoutput and X) by multiply Wo, that is the final output of multi-head attention layer in the first encoder.

Fig 7: Attention Scores of “making” in Eight Different Attention Heads. In each head, the word “making” recognizes its relevance with other words in different sense. In head 2, “making” recognizes “2009” is its subject. In other heads, it recognizes “making more difficult” is a phrase. (image source: Vaswani, et al., 2017, revised)

(3). Add & Norm Layer

Step 5 in Fig. 6 represents the summation operation between matrices Zoutput and X^(¹) in the “Add & Norm” sublayer. X’ — the updated X, can thus retain the original word meaning while incorporating information about dependencies between it and other words.

X’ is then subjected to layer normalization, which normalizes across different dimensions of each word of a single sample, rather than across different samples (See link to understand layer normalization).

(4). Feed-Forward Layer

The resulting matrix Z is then processed by two fully-connected sub-layers with dropout and ReLu activation functions in between. The fully-connected sub-layers apply to the updated embedding x’ of each individual word separately. See Fig. 2.

After another round of Add & Norm layer, the final output Xout of the first encoder stack is fed into the next encoder.

(5). N stacks of Encoders

Repeat the above process N times, more complex relationships in the context, beyond the word-to-word level, can be captured.

B. Decoder

Fig 8: Decoder Side Mechanism. Left: N stacks of decoders. Right: Masked attention mechanism to prevent the future words to attend attention scores calculations. (Image Source: Left: Vaswani, et al., 2017, revised; Right: image by author)

In the decoder, the generated text sequences are produced one by one. Each output word is considered as the new input, which is then passed through the encoder’s attention mechanism. After N encoder stacks, a softmax output predicts the most probable generated sequence.

Note: The process may seem sequential, during real training, the decoder can process the prediction of next words in the entire sequence in parallel.

The mechanism of each layer on the decoder side is similar to that on the encoder side, with some differences due to the causual masking effects.

Attention Mechanism

(1). Masked Multi-Head Attention Layer

This self-attention layer in decoder also employs the attention mechanism, but with future word masks to prevent access to future information. Thus, it is also called causal self-attention layer. The causal masking mechanism is depicted on the right side of Fig. 8.

(2). Cross Attention Layer

The subsequent attention layer is referred to as a cross-attention layer, which concatenates the encoder’s output embeddings with the embeddings from the previous layer “Add & Norm” to perform another round of attention calculations.

BERT & GPT

Fig 9: BERT vs GPT. BERT: transformer encoder-based, bidirectional. GPT: transformer decoder-based, left-to-right. Image Source: Devlin, et al., 2018

After Transformer was proposed in 2017, both Google and OpenAI have leveraged certain part of Transformer to develop BERT and GPT models leading to significant achievements in the NLP field.

BERT: Google proposed BERT in 2018 which utilizes only the transformer encoder architecture to predict randomly masked words. As it is bidirectional, it allows for better understanding and interpretation of long contexts.

GPT: OpenAI’s GPT only uses transformer decoder stacks to predict the next word in a sequence, making it a left-to-right model. Although the architecture from GPT-1 to GPT-3 have remained largely the same, the performance has improved significantly with larger models and datasets. GPT-3 has demonstrated a capacity for in-context learning, showing a very phenomenal achievement!

In my next post “A Deep Look at Transformer Based Models”, I present detailed explanations of Google’s BERT, OpenAI’s GPT and other Transformer-based models.

Also check out my YouTube Video Transformer based BERT and GPT, relevant to today’s post (in Chinese though :p)!

--

--