Mathematics Behind Transformer Language Models.

Published in

Analytics Vidhya

5 min readDec 29, 2023

Multi-head attention Transformer language models have emerged as the defacto architecture choice in most NLP tasks. First proposed in the paper titled “Attention Is All You Need” by Vaswani et al. in 2017. As the name indicates, the attention mechanism can enable the model to focus on different parts of the input sequence simultaneously, enhancing its ability to capture dependencies and relationships across the sequence. This blog intends to demystify attention mechanisms.

Word Embeddings and Positional Embeddings

The preprocessing of text inputs to the language model involves converting the string representation of words into its unique integer representation called tokenization. Transformer language models apply the strategy of tokenizing input words into smaller units called subword tokens. This approach is employed for several reasons:

Handling Out Of Vocabulary (OOV) Words:
Breaking words into subword units helps in capturing these morphological similarities, generalizing and representing rare or unseen words. Languages often contain morphologically related words with similar prefixes and suffixes enabling the model to handle basic lexical units of languages.
Contextual Understanding: Tokenizing different parts of speech separately helps the transformer model understand the context and relationships between words in a sentence.
Capturing Relationships, Dependencies and Structural Understanding: Conjunctions and Prepositions play a crucial role in play a crucial role in defining relationships between words or phrases. Tokenizing these words separately allows the model to understand the connections and dependencies between different parts of the sentence.
Efficient Representation: Subword tokenization allows a model to learn a more efficient and compact representation of language. Instead of having a separate token for every word, the model can learn from a smaller vocabulary of subword units, which are more reusable across different words.

Integer representation of subwords can converted into one-hot encoded representation vectors of the vocabulary size, vice-versa.

Mathematically, the process of word embedding is to convert integer/one-hot encoded sparse vector representation into dense vectors. This is essentially a linear transformation process from higher dimensional space to lower dimensional space. The process of embedding is vital for several reasons.

One Hot Encoded vectors are orthogonal to each other, thus the dot product of any two words is always zero. This does not help capture linguistic relationships.
One Hot Encoded vectors are extremely sparse in nature, with mostly 0s & few 1’s. This does not contain inherent information about the semantic similarity or context between words. All words are equidistant from each other, leading to no measure of similarity or closeness between them.
Limited Expressiveness: Such representations lack expressiveness, hindering the model’s ability to understand nuances in language and context.

While word embeddings capture linguistic relationships, Positional Encoding enables the model to capture relative or absolute position information.

pos — position of the word in the sequence.

dim — hidden dimension size of Embedding. (768)

i — i th position embedding.

Let’s break it down …..

Sine and Cosine functions help the model encode and generate smooth and continuous patterns across different positions in a sequence.

Varying ‘i’ helps mitigate the problem of having repeating values for different words, as the sine function is continuous and periodic in nature. The frequency of the sine function can either increase or decrease by varying the value of ‘i’, thus generating different values and mitigating repeating values.

Attention Module

Fig (1): Multi-Head Attention Illustration. Fig (2): Single Attention Head Computational Graph.

The figure is a mechanism inside a single attention head. BERT (Bi-Directional Encoded Representation of Transformers), proposes using a multi-head attention. Multi-head attention parallelizes the process of applying attention across different heads.

12 is the number of heads proposed by the original authors. Base Transformer models have 768 as the hidden dimension size, while Large Transformer models have 1024 as the hidden dimension size.

Query (Q), Key (K), and Value (V) matrices are calculated by multiplying a weight matrix of size (Hidden_Dim x Hidden_Dim/Num_of_Heads).
Query (Q) and Key Transpose (k’T)matrices are multiplied to produce a matrix of size (sequence_length x sequence_length). Softmax function is crucial for converting raw, unnormalized attention scores to a range of [0, 1] and helps in interpreting the scores, where higher scores receive higher probabilities, reflecting the relative importance of different elements.

The division by the square root of the hidden dimension divided by the number of attention heads is a scaling factor used in the attention mechanism of Transformers. This scaling is employed to control the variance of the dot products in the self-attention mechanism.
Attention Concentration: This scaling factor also helps in controlling the magnitude of the attention scores. Lower magnitudes encourage a more focused attention distribution, preventing extremely sharp peaks in attention and allowing the model to attend to multiple parts of the input sequence more evenly.
This scaling ensures that the softmax function receives inputs of a reasonable range, which helps prevent gradients from becoming too small or too large during the training process. This stabilization aids in more stable and efficient learning.
The resultant attention weight of a single attention head is calculated by multiplying the softmax output with the Value (V) matrix of size (sequence_length x hidden_dim/num_of_heads).
By aggregating the attention weights across all the attention heads, enables to model to learn from a diverse range of information captured by each attention head.

The BERT model consists of multiple encoder layers, each encoder layer consists of the following

Multi-Head-Attention Module > LayerNormalization > Feed Forward Neural Layer > LayerNormalization.

Depth of Representation can potentially capture more complex patterns and hierarchical relationships within the data. The 12 layers in BERT allow for the gradual aggregation and refinement of information from different layers, enabling the model to learn increasingly abstract and nuanced representations of the input text.

Each layer adds incremental improvements in the model’s ability to capture hierarchical and contextual information, contributing to more sophisticated representations.

I hope this article helped you gain a deeper understanding of how attention mechanisms help transformer language models learn and understand linguistic relationships and dependencies across words in the input text enabling the model to be able to solve several tasks in Natural language processing and Natural Language Understanding.

Mathematics Behind Transformer Language Models.

Word Embeddings and Positional Embeddings

Attention Module

Written by Shiv Vignesh