Ultimate guide to fine tuning LLMs

8 min readJun 17, 2024

Fine-tuning of generative AI models involves adapting pre-trained models to perform specific tasks or behaviors by updating their parameters with new data or information. This process enhances the model’s performance and accuracy in generating content relevant to the targeted application. Fine-tuning allows developers to leverage the general knowledge and skills of a large and powerful model and apply them to a specific field or objective. For instance, a pre-trained model that can generate natural language texts can be fine-tuned to write poems, summaries, or jokes. This broad section of updating LLM models to suit our business use case is called Parameter Efficient Fine Tuning (PEFT).

Fine tuning vs RLHF

Fine-tuning large language models (LLMs) and reinforcement learning from human feedback (RLHF) are two distinct methods used to adapt pre-trained LLMs to specific tasks or domains. Fine-tuning involves updating the model’s parameters using a labeled dataset to align it with the desired task or domain. This process can be done using various techniques such as instruction tuning, where the model is fine-tuned using labeled instructions and responses, or supervised fine-tuning (SFT), where the model is trained on a labeled dataset to predict specific outputs. The goal of fine-tuning is to enhance the model’s performance and accuracy for a particular task or domain. RLHF, on the other hand, is a method that uses human feedback to fine-tune LLMs. It involves training a reward model that provides a signal to the generated outputs, and a policy model that guides the LLM towards correct decoding during generation time. This approach is particularly useful for tasks that require complex decision-making or nuanced understanding of human preferences, such as content moderation or conversational AI. RLHF has been successfully used to develop powerful LLMs like GPT-3.5 and Claude. The choice between fine-tuning and RLHF depends on the specific use case and its goals and requirements. Fine-tuning is ideal for tasks that require a high degree of accuracy and precision, while RLHF is better suited for tasks that require complex decision-making or nuanced understanding of human preferences. In some cases, a hybrid approach that combines the strengths of both methods can be beneficial, where human feedback is used to kickstart the fine-tuning process, and the model trained on that feedback is then used to generate feedback for further training.

Different kinds of model fine tuning

Retraining all parameters: This method involves retraining all the parameters of a pre-trained model on a new dataset. This approach can be computationally expensive and may lead to overfitting if the new dataset is small. It is useful when the new dataset is large and diverse, and the model needs to learn new patterns and relationships.
Transfer learning: This method involves using a pre-trained model as a starting point and fine-tuning only the top layers or a subset of the model’s parameters on a new dataset. This approach is useful when the new dataset is related to the original dataset and the model can leverage the knowledge it has already learned. It is computationally efficient and can lead to better performance than retraining all parameters.
Parameter efficient fine tuning (PEFT): This method involves fine-tuning a pre-trained model by updating only a subset of the model’s parameters, rather than all of them. This approach is useful when the new dataset is small or the model needs to learn specific patterns or relationships. It is computationally efficient and can lead to better performance than retraining all parameters.

Here we are discussing in detail, what are the different ways of doing PEFT.

Parameter Efficient Fine tuning techniques

Adapters

Adapters are simply special sub modules which are inserted in each encoder structures to adapt to the custom dataset. During the fine tuning process, nothing but these weights associated with these adapters are updated, while keeping everything else frozen. The most straightforward approach of adding adapters to transformer models is to inserting these adapter modules after each encoder layer and adding a classifier layer on the top of the pre-trained model. The major objective of these type of fine tuning methods is to reduce the complexity and computation expenses. The non-linear layer within the adapters are applying non linear activation functions to FNN layer.

How adapter layers are inserted to encoder layer which is tuned during the fine tuning process

The drawbacks of adapters for fine-tuning include overfitting, limited generalization, dependence on pre-trained models, additional computational cost, limited flexibility, potential for catastrophic forgetting, and reduced robustness. Adapters can lead to overfitting, especially when the training data is limited, and may not generalize well to other tasks or domains. Furthermore, adapters can be less flexible and may not work well with complex tasks or dynamic environments. Additionally, they can lead to catastrophic forgetting and reduced robustness, making them less effective in certain situations. So to tackle this, we have LORA, QLoRA methods of fine tuning which are discussed below.

LoRA

Image shows the matrix addition of W matrix with update matrix(which is decomposed into A and B)

LoRA (Low-Rank Adaptation) is a technique designed to fine-tune large models efficiently by updating only a small part of the model’s weights, specifically targeting those that have the most significant impact on the task at hand. This approach contrasts with traditional fine-tuning methods, where a large portion of the model’s weights might be updated, requiring substantial computational power and time.

In simple terms, lets say, during the fine tuning process, we have a weight matrix W of size 512x512 to be updated and tuned to align with our specific use case. So, if we are going full swing weight update method, we will have to update and store the gradients of this wight matrix which is of the same size as of the original weight matrix W. Instead, we initialise the update the wieght matrix W with ΔW which is to be added to original matrix. But, on extensive research, it was found that these ΔW always have much lesser rank(i.e. these update matrices have linearly dependent features which add not much value to the overall matrix). So, we could represent these update matrices as matrix multiplication of 2 vectors of A of shape 512xr and B of shape rx512. Where r is the rank of the ΔW. r could be selected as parameter of the model and weights in B and A are updated during the fine tuning process.

QLoRA

QLoRA is a fine-tuning technique that combines the benefits of LoRA and quantization to improve the efficiency and performance of large language models (LLMs). Here are the key steps involved in QLoRA fine-tuning:

Quantization: QLoRA uses quantization to reduce the precision of the model’s weights from 32-bit floating-point numbers to 4-bit integers. This reduces the memory requirements of the model.
LoRA: QLoRA uses LoRA to fine-tune the model. LoRA adds a small number of trainable parameters to the model while keeping the original model parameters frozen. This allows the model to adapt to the specific task or dataset without requiring full retraining.
Double Quantization: QLoRA uses double quantization to further reduce the precision of the model’s weights. This involves quantizing the quantization constants used during the quantization process, which helps to save additional memory.
Paged Optimizers: QLoRA uses paged optimizers to optimize the model’s weights. This involves dividing the model’s weights into smaller chunks and optimizing each chunk separately, which helps to reduce the computational requirements.

This method is same as the LoRA fine tuning process. Prior to the LoRA fine tuning process, the weights are quantized to 4 bit precision for faster computation and memory efficient. Although, the updated weights are not quantized and remain at higher precision.

Prefix Fine tuning

Prefix fine-tuning is a method for adapting pre-trained language models to specific tasks or domains by adding task-specific prefixes to the input sequence. The process involves initializing a trainable prefix vector and adding it to the attention mechanism of each decoder layer in the model. During training, only the prefix vector parameters are updated, while the rest of the model’s parameters are frozen. The prefix vector learns task-relevant patterns that affect the final output, allowing the model to adapt to the target task without extensive modification of the entire model. This approach is computationally efficient, requiring fewer parameters to be updated compared to traditional fine-tuning methods, and can achieve comparable performance to full fine-tuning with a much smaller parameter count.

Basically, we are just giving a set of guidance as prefix to tell the model on how to treat the input sequence. And these set of prefix will fine tuned to adapt to get the best result.

IA3

IA3 (Iterative Adapters) is a fine-tuning method for large language models (LLMs) that involves iteratively adapting the model to a specific task or domain. The fine tuning technique starts with a foundational model which is trained on lasrge corpus of data. The model is then iteratively adapted to the specific task or domain by adding new layers or modifying existing ones. This process is repeated multiple times to refine the model’s performance. These model is expose to a diverse set of data and during each iteration, allowing it to learn and adapt to new information and refine its understanding of the task or domain.

IA3 is a powerful fine-tuning method for LLMs that involves iterative adaptation and knowledge injection. It can lead to better performance on specific tasks compared to other fine-tuning techniques. IA3 is particularly useful when the model needs to adapt to a specific domain or task and requires continuous refinement.

Conclusion

Nevertheless, as more and more fine tuning techniques are evolving day by day, keeping updated is getting harder than anytime. Research about the techniques available while handling a project, understanding the business use case of the project and deciding the best technique that suits would make project life cycles more streamlined and less cluttered.