Large Language Model and Transformers

10 min readDec 22, 2023

A Large Language Model (LLM) is a type of deep learning model that is trained on large amounts of text data to understand and generate human-like language. LLMs utilize the transformer models, which is pre-trained using massive amount of data. Examples of LLMs include BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer) models, such as GPT-3.5, which is the architecture underlying this conversation. LLMs are designed to perform a wide range of natural language processing tasks, including language translation, text summarization, question answering, and much more.

Transformer is an exceptional innovation in the field of Deep Learning, contributed by Ashish Vaswani et al. (2017). The transformer is the most influential Neural Network model that has shown outstanding performance on various NLP tasks including Machine Reading Comprehension, Machine translation and sentence classification. Attention mechanism and parallelization are the prominent features in the transformers. Consequently, it can facilitate long-range dependencies without any gradient vanishing or gradient explosion problems and it overcomes the drawbacks of the existing methods such as RNN and LSTM. The transformer is executed with an encoder-decoder mechanism and the original paper of transformers “Attention All You Need” has been used six encoder-decoder stacks. Self Attention and Feed Forward Neural Network are the core components in the Transformer encoder Layer. As in Figure, the architecture of the transformer has also included some additional components such as Multi-head Attention, Masked Multi-head Attention, positional encoding, linear layer and SoftMax layers. Encoder executes with each entity in the input embedding and compiles the information from the vector and the captured information will be send over to the decoder for transmitting the output.

**Architecture of the Transformer [**Ashish Vaswani et al. (2017), “Attention All You Need”]

Word embedding is the initial process, happens at the bottom-most of the encoder layer. The process of converting a word to a vector in a vector space is termed word embedding or vectorization. The vector occupied in the vector space can capture a lot of semantic information of the words with basic algebraic operations. After word-embedding, each vector flows parallel way to the two layers of the encoding layer. Six encoders in the transformer are identical and do not share weights. Each encoding layer broke down into sub-layers, a Self-attention (with padding mask) layer and a Position wise feed-forward network layer. There are dependencies in the vector flow path in the self-attention layer. Each of these sub-layers of the encoding layer has a residual dropout followed by the sub-layer input and normalization, which helps to overcome the vanishing gradient problem in deep networks. The output of each sub-layer is calculated by LayerNorm(x + Sublayer(x)). There are no dependencies in the FFNN layer and various vector flow paths are executed in parallel. The self-attention layer helps the encoder to emphasize a specific word and leads to better encoding of the word. The output from the self-attention layer is fed to the FFNN layer. The decoder has an additional Encoder-Decoder Attention layer in between the self-attention and FFNN layer, which helps to emphasize the most relevant part of the input context.

The point-wise FFN for a given sequence of vectors h1,h2,h3,………hn, can be calculated with an activation function (Ruibin Xiong et al., 2020) — Rectified Linear Unit (ReLU) using the equation (1), where w1,w2,b1 and b2 are the parameters. The layer normalization reduces the covariate shift, which means the gradient dependencies between each layer and normalizes the values in each layer with 1 variance and 0 means. That facilitates the lesser training iteration. Each hidden unit hi can be computed by the equation (2) and where g is a variable and H is the number of elements. The variance, σ and mean, µ can be calculated by the following equations (3) and (4) respectively.

Self or intra attention has been played a key role in the understanding mechanism of the transformer. The attention mechanism provides intensive attention to the most relevant word at each step and gives the information specific to each input word. The attention mechanism can capture the relevant information even from the longer sentences by giving more weight to the relevant words. Attention is the mapping between a query and a set of key-value pairs. The attention mechanism is a process of mimicking the retrieval of a value vector vi for a query vector q based on key vector ki in the database as shown in Figure.

The queries are from the decoder hidden state, while the key-value pairs are from the encoder hidden state. These vector representations retrieve the information in the attention layer by computing the similarity between the decoder queries and encoder key-Value pairs. Key-Value pairs have separate matrices, but the matrices have the same dimensions. The output is calculated by the weighted sum of the values, where the weight of each value is determined to be the compatibility function of the query with the respective key. The similarity of query vector and key vector is multiplied by the weighted combinations.

Roughly there are six steps for the calculation of the encoder’s input vector. The first step is to create three vectors for each input word vector. One is the Query vector, the second one is the Key vector and the other is the value vector, which are the abstractions in the attention calculation. These vectors are multiplying with the embedding matrix for retrieving the attention calculation score. The second step is the score calculation, here to find out the score of each word of the input context against the first input word. To recapitulate, ‘Thinking’ is the first input word as represented in Figure — Attention Score Calculation. The score determines how much attention to give the remaining input context as encoding a word at a specific position and that calculated by the dot product of the query and the key vectors of the corresponding word. The self-attention score for the word at the first position can be calculated like first score=q1.k1, second score=q1.k2…and so on. In the third step, divide all scores by 8. The relevance of division is 8 is the square root of the key vector with dimension 64. This operation provides more stable gradients and which is the default value. The attention matrix output can be computed by the equations (5) and (6).

Then in step 4, passes the resultant scores to the softmax layer. The softmax layer normalizes the values to 1 and determines the precedence of each word at the position. The word that carries the highest softmax value has the higher priority and sometimes the focused word is useful to emphasize other words that are relevant to the current word. In step 5, the softmax score multiplies with the value vector. The intuition behind the multiplication is the elimination of the less frequent words and preserves the relevant words from the input context. The final step produces the output by summing up the weighted value vectors at the position for the first word. The resulting vector passes along to the FFNN.

Transformer encoder uses the Multi-Headed Self Attention mechanism that gains knowledge of contextual relations between words or sub-words in the corpus. Multi-head attention can improve the performance of attention layers by expanding the ability of the model to focus on different positions and it gives an attention layer multiple representation subspaces. As represented in architecture, linear layers and split into heads, Scaled dot-product attention, Concatenation of heads and Final linear layer are the four parts of the Multi-head attention. Each block gets three inputs as Query (Q), Key (K) and Value (V) pairs. These input representations flow to the linear-dense layer and split up into multiple heads. Transformer executes with eight attention heads. Consequently, each randomly initialized eight sets end up for each encoder-decoder.

After training, each set predicts the input word vectors from the lower encoder-decoder into different vector spaces. But the input to the FFNN must be a single matrix instead of eight matrices. So the eight matrices should be condensed to a single matrix. To produce a single matrix, first, concatenate eight matrices attention heads and then multiplies with a weight matrix Wo, which was trained jointly with the model. The result should be the Z matrix that captures information from all the attention heads. Then the resultant Z matrix passes to the FFNN. The multi-head attention is implemented in the transformer in three different ways. Initially, it was used in the encoder-decoder attention layers for capturing the queries from the previous decoder and the keys and values from the encoder. Then in the self-attention layer, the multi-head attention facilitates the encoder to attend all positions in the previous layer of the encoder and finally used in the decoder for the same functionality. Multi-head attention can be calculated by using the equations (7) and (8).

Masked Multi-headed Attention in the decoder layer masks the shifted output. In masked attention, the probabilities of masked values are nullified to restrict them from being captured. In the decoding layer, an output should depend only on the previous outputs. Masked attention of the Query, Key and Value pairs can be computed using the equation (9) and M is the mask matrix of 0’s and −∞’s. Unlike in the encoder layer, the masked results in the decoder are unidirectional.

Positional Encoding

The positional encoding (PE) can provide information about the order of tokens in the sequence. To obtain PE, the transformer adds an embedding vector to distinguish each position of tokens and vectorization arranges the token in a d-dimensional vector space. The embeddings do not encode the relative position of words and the tokens will be represented in the adjacent position of the d-dimensional space based on the similarity of their meaning and the order in the sentence. Then PE can provide some information about the relative position of the words. In PE, the position of the tokens can be fixed or learned, the fixed position selects when extrapolating to longer sentences. Unlike the encoder self-attention layer, the self-attention layer in the decoder is allowed to focus on the earlier position in the input sentence by masking future positions. These functions are performed before the SoftMax layer in attention calculation. The following figure is representing the empirical executed result of positional encoding with d-dimension is 512. Sinusoidal positional embeddings utilize the sin(x) and cos(x) functions to generate the embedding. The positional encoding with dimension can be calculated by the equation (10) and the equation (11).

Fully Connected Linear Layer and SoftMax Layer

The linear layer is a simple Fully Connected Neural Network and converts the outputs from the decoding layer to words. The linear layer projects the outputs from the decoder stacks into a huge vector called logits vector and this vector includes identical space that carries a score for each word in the output cell. The objective of softmax in the Transformers is to grab a sequence of arbitrary real numbers and the negative numbers turn to positive numbers. The sum of the entire numbers in the series is equal to 1. In the input scale sequence the largest scale input dominates as the output. With increasing scale of any specific input, the softmax assigns the input value close to 1 and the remaining in the sequence close to 0.

Transformer shows unprecedented performance over previous RNN-based deep-learning models such as LSTM and GRU. The advantages of transformer over traditional models are it is easier to train; training on unsupervised text is possible and transfer learning works efficiently. So the pre-trained and fine-tuned models can be used for other downstream Natural Language Processing tasks. Nevertheless, pointing out some shortcomings of the transformer. The attention mechanism in the transformer can deal with fixed-length text strings. The context needs to be fragmented into several chunks before being fed into the input embedding. The chunking or context fragmentation may or may not be in the semantic boundary. This problem may lead to the segmentation of semantically interconnected sentences, so the model cannot preserve the relevant semantics in context.

Large Language Models (LLMs) are a class of artificial intelligence models specifically designed to understand and generate human-like language. They have broad applications in natural language processing tasks and can be fine-tuned for specific domains. While Transformers are a type of neural network architecture that has proven highly effective in various machine learning tasks, including natural language processing. They employ a self-attention mechanism that allows them to weigh the importance of different parts of input data when making predictions or generating output. Transformers have become the backbone of many modern LLMs due to their ability to capture long-range dependencies in data efficiently. The transformer architecture has significantly improved the performance of language models, enabling the development of large-scale models. that excel in understanding and generating human-like text.

Large Language Model and Transformers

Written by Nisha Varghese