An update on the best way to fine-tune LLMs such as LLama 2, Mistral 7B Instruct v0.2, or Phi-2, using consumer-grade GPUs or free online resources such as Google Colab or Kaggle Notebooks

credits: DALL·E 3

One month in the real world is like one year in the world of LLMs, such is the speed and momentum of innovation. Therefore, I recently had to update the Kaggle notebook I prepared just at the end of last October because it didn’t work anymore for many readers:

A few new things render the notebook a perfect starting point for fine-tuning any open Language Models using Hugging Face’s TRL, Transformers, and Datasets packages at their most recent version.

First of all, the used packages are:

  • PyTorch 2.1.2 (previously 2.0.0)
  • transformers 4.36.2 (previously 4.31)
  • datasets 2.16.1
  • accelerate 0.26.1 (previously 0.23.0)
  • bitsandbytes 0.42.0 (previously 0.41.1)

As for trl, I picked a commit from GitHub published on Jan 22, 2024, and for peft, I retrieved another commit published on the same date (so both packages are as fresh as possible).

There are other differences in the code.
The trl library now simplifies the process of setting up a model and tokenizer for conversational AI tasks with the help of the setup_chat_format() function. This function performs the following tasks:

  1. Introduces special tokens, such as <s> and <e>, signifying the beginning and end of a conversation to the tokenizer.
  2. Adjusts the model’s embedding layer to accommodate these newly added tokens.
  3. Defines the chat template of the tokenizer, responsible for formatting input data into a conversation-like structure. The default template is chatml, which was inspired by OpenAI.
  4. Additionally, users have the option to specify the resize_to_multiple_of parameter, enabling them to resize the embedding layer to a multiple of the provided argument (e.g., 64).

Here is an example of how to use this function:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

# Set up the chat format with default 'chatml' format
model, tokenizer = setup_chat_format(model, tokenizer)

Adding special tokens to a language model during fine-tuning is crucial, especially when training chat models. These tokens are pivotal in delineating the various roles within a conversation, such as the user, assistant, and system. By inserting these tokens strategically, the model gains an understanding of the structural components and the sequential flow inherent in a conversation.

In other words, the setup provided by set_chat_format assists the model in recognizing the nuances of conversational dynamics. The model becomes attuned to transitions between different speakers and comprehends the contextual cues associated with each role. This enhanced awareness is essential for the model to generate coherent, contextually appropriate responses within the context of a chat environment.

Another change has been adding the parameter target_modules=”all-linear” to LoraConfig. The LoraConfig object contains a target_modules parameter to be expressed as a list or an array. In some examples you find online, the target modules commonly are [“query_key_value”]; somewhere else, they are something else (in our case, the linear layers, expressed by “all-linear” string value), but always referring to the Transformer architecture. The choice of what layers to fine-tune actually depends on what you want to achieve (and what works better with your problem). As stated in the LoRA paper (https://arxiv.org/abs/2106.09685), Hu, Edward J., et al. “Lora: Low-rank adaptation of large language models.” arXiv preprint arXiv:2106.09685 (2021), “we can apply LoRA to any subset of weight matrices in a neural network to reduce the number of trainable parameters” and that “we limit our study to only adapting the attention weights for downstream tasks and freeze the MLP modules … both for simplicity and parameter-efficiency”. Finally, the paper states that “we leave the empirical investigation of adapting the MLP layers, LayerNorm layers, and biases to a future work” implying that you can actually fine-tune whatever layers you want based on the results you obtain and your “parameter budget” (more layers you fine-tune, more computations and memory are required). This is stated even more clearly in section 7.1 of the paper, “WHICH WEIGHT MATRICES IN TRANSFORMER SHOULD WE APPLY LORA TO?”, where the choices of the author of the paper are justified by their parameter “budget”, but you are not limited to just that, you have to look for the best performance overall given your architecture and problem.

The default LoRA settings in peft adhere to the original LoRA paper, incorporating trainable weights into each attention block's query and value layers. This is what I did in the first implementation of the fine-tuning. However, in the QLoRA paper (https://huggingface.co/papers/2305.14314), research revealed that introducing trainable weights to all linear layers in a transformer model enhances performance to match that of full-finetuning. Given that the selection of modules may differ based on the architecture, and you would have to search manually in the architecture of the model of your choice for such linear layers, they have introduced a user-friendly shorthand: simply specify target_modules=’all-linear,’ and let the left package take care of the rest.

Given such changes, the overall accuracy on the test set rises up to 0.863 from the initial 0.823.

--

--

Luca Massaron

Data scientist molding data into smarter artifacts. Author on AI, machine learning, and algorithms for Wiley, Packt, Manning. 3x Kaggle Grandmaster.