Fine-tuning a large language model on Kaggle Notebooks for solving real-world tasks — part 1

Luca Massaron
5 min readOct 23, 2023

--

Exploring, with simple words and concepts, some theory and ideas on adapting LLMs to your needs

credits: DALL·E 3

Thanks to their in-context learning, generative large language models (LLMs) are a feasible solution if you want a model to tackle your specific problem. In fact, we can provide the LLM with a few examples of the target task directly through the input prompt, which it wasn’t explicitly trained on. However, this can prove dissatisfying because the LLM may need to learn the nuances of complex problems, and you cannot fit too many examples in a prompt. Also, you can host your own model on your own premises and have control of the data you provide to external sources. The solution is fine-tuning your local LLM because fine-tuning changes the behavior and increases the knowledge of an LLM model of your choice.

Fine-tuning requires more high-quality data, more computations, and some effort because you must prompt and code a solution. Still, it rewards you with LLMs that are less prone to hallucinate, can be hosted on your servers or even your computers, and are best suited to tasks you want the model to execute at its best. In these two short articles, I will present all the theory basics and tools to fine-tune a model for a specific problem in a Kaggle notebook, easily accessible by everyone. The theory part owes a lot to the writings by Sebastian Raschka in his community blog posts on lightning.ai, where he systematically explored the fine-tuning methods for language models.

Since we’ll be working with a Llama 2 model taken from Kaggle Models, you must visit the model’s page on Kaggle (https://www.kaggle.com/models/metaresearch/llama-2) and follow the instructions there to ask Meta for the access to their model (you can use this page: https://ai.meta.com/resources/models-and-libraries/llama-downloads/).

Fine-tuning for language models already has a history of working with generative models like GPT, based on decoder architectures, and embedding-centric models like BERT, which rely on encoder architectures (the E in BERT stands for encoder). This involves keeping frozen in terms of weight update a larger or lesser part of the language model and attaching a machine learning classifier (typically a logistic regression model, but it can be a support vector classifier, a random forest, or an XGBoost model) or some additional neural architecture to the end of the model. The more you keep unfrozen the original language model, the more parts of it, especially the embeddings, will adapt to your problem (and you will get better evaluation scores for your model), but that will require a lot of computation power if the model is large (and LLMs are incredibly huge in terms of weights and layers) and also a lot of data because you need a lot of evidence for correctly updating many parameters in a model. Suppose you have a few labeled examples of your task, which is extremely common for business applications and not many resources. In that case, the right solution is to keep most of the original model frozen and update the parameters of its classification terminal part.

Therefore, there are increased limitations when it comes to large language models because you cannot easily have the computational power and volume of data sufficient to update its layers. Fortunately, various ingenious approaches for fine-tuning LLMs have been devised in recent years, ensuring excellent modeling results with minimal parameter training. These techniques are commonly known as parameter-efficient fine-tuning methods or PEFT. All PEFT methods involve prompt modification, adapter methods, and parameter updates:

  • Prompt modification involves altering the input prompt to attain the desired results. It can be achieved by hard changes when we directly change the prompt by trial and error or by soft changes when we rely on backpropagation to figure out how to enhance the embeddings of the existing prompt by learning an additional tensor of free embeddings. These methods intervene at the beginning of the architecture of LLMs.
  • Adapter methods involve adding inside the architecture of the LLM a few adaptable layers that are updated by backpropagation and modify the model’s behavior. The methods intervene in the middle of the architecture of LLMs
  • Parameter updates may involve a specific part of the network weights or the network itself by a low-rank adaptation of the weights (https://arxiv.org/abs/2106.09685), a method that “can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by three times”.

In particular, parameter updates by low-rank adaptation (LoRA) is a kind of hacking the regular backpropagation updates by splitting the update matrix into two smaller matrices that, multiplied together, can give back the original update matrix. This is similar to matrix decomposition (such as SVD), where a reduction is obtained by allowing an inevitable loss in the contents of the original matrix. In our case, when training LLMs for specific tasks, a loss of its original complexity is actually permissible for the LLM to gain expertise on our task of interest.

Therefore, if the update matrix dimension for a layer is 1,024 by 1,024, which equates to 1,048,576 numeric values, a decomposition into two matrices sized 1,024 by 16 and 16 by 1,024, which multiplied can return something similar to the original matrix, will decrease the numeric values to be handled to 32,768.

This matrix decomposition is left to the backpropagation of the neural network, and the hyperparameter r allows us to designate the rank of the low-rank matrices for adaptation. A smaller r corresponds to a more straightforward low-rank matrix, reducing the number of parameters for adaptation. Consequently, this can accelerate training and potentially lower computational demands. In LoRA, selecting a smaller value for r involves a trade-off between model complexity, adaptation capability, and the potential for underfitting or overfitting. Therefore, conducting experiments with various r values is crucial to strike the right balance between LoRA parameters.

Moreover, after finishing fine-tuning, if we keep the low-rank matrices we used for the updates, which do not weigh much, we can reuse them by multiplication on the original LLM that we fine-tuned without any need to update the weights of the model itself directly. At this point, we can save memory and disk space by reducing the size of the LLM on which LoRA has been used. The answer is quantizing the original LLM, reducing its precision to 4-bit precision. It is just like compressing a file, and in the same way, the LLM is kept compressed (i.e., quantized) only to be expanded when it is necessary to compute the LoRA matrix reduction and update. In this way, you can tune large language models on a single GPU while preserving the performance of the LLM after fine-tuning. This approach is called QLoRA, based on the work by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer (see https://arxiv.org/abs/2305.14314). It is also available as an open-source project on GitHub.

In the upcoming second part of this article, I will offer references and insights into the practical aspects of working with LLMs for fine-tuning tasks, especially in resource-constrained environments like Kaggle Notebooks. I will also demonstrate how to effortlessly put these techniques into practice with just a few commands and minimal configuration settings.

--

--

Luca Massaron

Data scientist molding data into smarter artifacts. Author on AI, machine learning, and algorithms for Wiley, Packt, Manning. 3x Kaggle Grandmaster.