Fine-tuning Large Language Models series: Internal mechanism of LLMs

Published in

Intro to Artificial Intelligence

9 min readJul 28, 2023

My intention is to post multiple articles about the fine-tuning of LLMs. In this article, I discuss what are the underlying technologies behind the LLMs.

Large Language Model (LLM)

An LLM is a deep learning model that is trained on a large corpus of text. It can recognise, summarize, translate, predict, and generate text and other forms of content[1]. There are cost and computation challenges with training LLMs. Models like GPT are required to have a huge processing and memory power. One approach is to use smaller pre-trained LLMs such as LLaMA and fine-tune them for specific domain tasks. This way we can reduce the cost and computation requirement. Pre-training of LLMs is done in unsupervised settings whereas fine-tuning is a supervised learning technique.

Transformer

Before we get into the finetuning part, I would like to discuss the Transformer. It is the deep learning architecture that powers the LLMs today. The transformer was first introduced in 2017 by Google in their “Attention is all you need” paper. It outperforms many architectures, such as recurrent neural networks (RNNs), long short-term memory (LSTMs), and gated recurrent neural networks (GRUs)[5], which were dominant in natural language processing (NLP) before the transformer’s arrival[6]. The popularity of the transformer is due to two main factors[4]:

Before the transformer, recurrent networks such as LSTMs were dominant models and required large labelled datasets, which was an expensive and time-consuming process.
Unlike previous models, which utilized sequential processing, the transformers are capable of parallel processing, which makes it faster.

Encoder-decoder architecture. Source:[2]

The original transformer proposed in Google’s paper is based on encode-decoder architecture. The encoder converts the input sentence in one language to a latent space which can be considered an imaginary language in this case [3]. The decoder then takes this latent space and translates it into the output format. Initially, the encoder and decoder are not familiar with the latent space for translation, and during the training process, they learn this space[3]

The mechanism of Transformers

Original Transformer architecture proposed in “Attention is all you need” paper.

We start with tokenizing the words. It is simply putting a number representation of each input word. Each number is a representation of a position in a dictionary of all possible words that the model can work with.

This is our input represented in tokens. Now the tokenized words are passed to the embedding layer of the encoder which produces the embeddings/vector representation of words.

Two-dimensional view of word embeddings. Source: [8]

The embeddings are a high-dimensional representation of words/sentences/text where similar content is closer to each other. The output of the embedding layer is added with positional encoding.

Combination of positional encoding and word embedding vectors. Source: [10]

Positional encoding is very essential to any language as it defines the grammar and thus the semantics of the text[9]. In the RNN/LSTM setup, we pass the sentence word by word sequentially and it provides the positional encoding in those architectures. However, in the case of a transformer, there is a recurring mechanism and each word is passed through the transformer simultaneously. So, it doesn’t have the positions of the words in the sentences. Attaching the positional encoding enables the transformer to get an understanding of the position of each word in a sentence.

The next stage of the encoder is a self-attention layer. This allows the model to learn the contextual dependency with the words. In other words, we can say it stores the relative importance of a word with all other words in the sentence. For instance, we want to translate the following sentence “The animal didn’t cross the street because it was too tired”. The pronoun “it” in the sentence “What does it refer to?” can refer to either the street or the animal. This is a simple question for a human to answer, but it can be difficult for an algorithm to determine which referent is correct[11].

It shows the relative importance of the word “it” with all other words in the sentence. Source: [11]

Self-attention allows the model to associate the word “it” with the word “animal” when the model is processing the sentence[11]. So, in the encoder, we have multi-head self-attentions. It means that we can have multiple self-attention mechanisms connected in parallel. Each self-attention head learns different aspects of language. For instance, one head learns the relationship between entities and people and another one may focus on an activity in the sentence. theoretically, we can have hundreds of attention heads.

Encoder part of the transformer. Source:[13]

The output of this layer is input to the feed-forward network. A feed-forward neural network is used to process each attention vector, transforming it into a format that can be understood by the next layer in the transformer[12]. The output of this network is the vector of logits of each and every token in the tokenized dictionary. Then this logit is passed to the final softmax layer. The softmax layer outputs the probability score of each word. There would be one single word that will have a higher probability in comparison which would be the predicted word.

The output of the encoder is considered the feature of the decoder as the encoder extracts the features of the input sentence. Source: [15]

Output of the encoder is passed to the middle section of the decoder where it activates the decoder’s sef attention. In another way, we can think about the features provided to the decoder to perform its task. For instance, if we need to translate the sentence from English to French. The output of the encoder is the features extracted from the input and the decoder uses these features to produce the translated sentence.

The output of the encoder is passed to the decoder. The output of the encoder acts as the features for the decoder. Source: [15]

The start of the sequence (<sos>) token is added to the input of the decoder as the first step and then the decoder outputs the next token (first word in French in the case of translation example) by utilizing the features provided by the encoder. The predicted token of the decoder is fed as input to the decoder to predict the subsequent token. This process continues to achieve the task. In the case of the translation example, the prediction of the next token process repeats until it finishes translating the sentence.

In a nutshell, the encoder encodes the input into a deep contextual representation of the structure and meaning of the input. It produces one vector per token. On the other hand, the decoder accepts the input token and generates the next token by utilizing the features of the encoder. Today, LLMs are powered by Transformer architecture. In the next section, we are going to discuss different types of transformer architectures that power various LLMs.

Types of transformers

There are three types of transformer architecture: encoder-only, decoder-only, and encoder-decoder.

Encoder-only architecture

As the name suggests, it only utilizes the encoder part of the original proposer transformer architecture by Google. This is also called auto encoding models. The pre-training of this model was done by masked language modelling (MLM).

Example of masked language modelling. Source: [16]

In the MLM, input sequences are randomly masked and the training objective is to predict the masked token in order to reconstruct the original sentence. This type of architecture is mainly used for sentiment analysis, named entity recognition, and word classification. BERT and ROBERTA are example models of encoder-only architecture.

Decoder-only architecture

Decoder models are called auto-regressive models and they utilize only the decoder part of the originally proposed transformer architecture. They are pre-trained using casual language modelling (CLM).

An example of casual language modelling (CLM). Source: [16]

The training objective of the CLM is to predict the next token based on the previous sequence of tokens. It provides the predicted token as input to output the next token. The main use case for this type of model is text generation, code generation, etc. Popular LLMs are GPT and BLOOM.

Encode-decoder architecture

These types of models are sequence-to-sequence models they are applicable for language translation, text summarization, and question/answering. As the name suggests, it utilizes both the encoder and decoder parts of Google’s proposed transformer architecture.

The pre-training of this model is based on span corruption. In this approach, random input tokens are replaced with special tokens called sentinel tokens. Sentinel tokens are not actual words, but they are added to the vocabulary so that the model can learn to predict them. Then, the decoder's task is to reconstruct the masked tokens auto-regressively. The output of the decoder is a sentimental token followed by the predicted masked token. T5 and BART are two well-known LLMs in this category.

Fine-tuning

Training LLMs are challenging in terms of computation and cost. For instance, For a model with 1 billion parameters, we need to have 4GB of GPU RAM. Imagine, how much it will be required to have a model like GPT-3 with 175 billion parameters. It’s not affordable for an individual to pre-train such large LLMS as it is required to distribute training most of the time. The standard way to tackle this issue in a specific domain is by fine-tuning smaller pre-trained LLMs on a specific domain task.

Fine-tuning variants

There are mainly three types of fine-tuning approaches: full fine-tuning, parameter-efficient approach, and fine-tuning with RLHF.

Full fine-tuning

In this approach, we update all weights of the model for a specific task. If we use smaller pre-trained LLMs, this task is relatively easier to train than the larger LLMs. However, we still require a large number of computation requirements for this.

Parameter-efficient fine-tuning

In this approach, we only update a small subset of parameters that are added to the pre-trained model. The pre-trained LLM’s weights are kept frozen, instead, we add a set of parameters/layers to the original LLM. And, then only train those parameters.

Fine-tuning with RLHF

Reinforcement learning from human feedback (RLHF) is a machine-learning technique that uses human feedback to train a reward model. The reward model is then used to optimize an agent’s policy using reinforcement learning to align the LLM to human values.