Memory requirements for fine-tuning Llama 2

TL;DR: Fine-tuning large language models like Llama-2 on consumer GPUs could be hard due to their massive memory requirements. However, techniques like Parameter Efficient Fine-Tuning (PEFT) methods, specifically QLoRA, can significantly reduce the memory footprint by up to 90% to around 9–14GB by quantizing the model to 4-bit precision and training only a small fraction (0.1–1%) of the total parameters. This makes fine-tuning more accessible and affordable, even on the free version of Google Colab with 16GB GPU memory!

Table of Contents

Why Llama-2 (and 7B-chat)?

Llama-2, released by Meta in 2023, is one of the most widely used open-
source
Large Language Models (LLMs) today. These models belong to a class called Foundation Models, models which are trained on massive amounts of data (~ 2 trillion tokens) and can be fine-tuned for specific tasks. The Llama-chat models are one such example of foundation models finetuned for dialogue use cases, like creating chatbots using LLMs. The smallest Llama 2 chat model is Llama-2 7B Chat, with 7 billion
parameters. It’s a powerful and accessible LLM for fine-tuning because with fewer parameters it is an ideal candidate for starting out with fine-tuning.

Naively fine-tuning Llama-2 7B takes 110GB of RAM!

Even fine-tuning small models like Llama-2 7B on regular consumer GPUs can be challenging due to the significant memory requirements because of the following reasons:

  1. Memory Requirements: Llama-2 7B has 7 billion parameters and if it’s loaded in full-precision (float32 format-> 4 bytes/parameter), then the total memory requirements for loading the model would be numberOfParams*bytesPerParam = 7 billion*4 = 28GB of memory. Given that many consumer GPUs/ free versions of software like Google Colab or Kaggle [4] have memory constraints (e.g., NVIDIA T4 16GB on Google Colab), the model cannot even be loaded!
  2. Fine-Tuning memory requirements: In the case of full fine-tuning with the regular 8bit Adam optimizer using a half-precision model (2 bytes/param), we need to allocate per parameter: 2 bytes for the weight, 2 bytes for the gradient, and 12 bytes for the Adam optimizer states [4]. This results in a total of 16 bytes per trainable parameter, requiring over 110GB of GPU memory!! This would require at least 3A40s with 48GB GPU VRAM, which would mean fine-tuning wouldn’t be accessible by public.

PEFT lowers memory requirements by 90%

Parameter Efficient Fine Tuning methods (PEFT) are used to drastically reduce the number of trainable parameters of the LLM model while maintaining performance [6].

A very popular PEFT method used for fine tuning LLMs is Low Rank Adaptation (LoRA) which drastically reduces the number of parameters to be modified [5].

Low Rank Adaptation (LoRA) for efficient fine-tuning

To make fine-tuning more efficient, LoRA decomposes the large weight matrix into two smaller, low-rank matrices which can be trained to adapt to the new data while keeping the overall number of changes low.

This approach of training only a small percentage of the total parameters’ weights using LoRA can result in fine-tuned models which are comparable to the performance of fully fine-tuned models while requiring a fraction of the compute resources.

Making fine-tuning more efficient: QLoRA

QLoRA is a fine-tuning technique that combines a high-precision computing technique with a low-precision storage method [1]. This helps keep the loaded model size small while still making sure the model is still highly performant and accurate.

QLoRA involves 4-bit quantization (4-bit NormalFloat (NF4) quantization) along with LoRA- which helps with fine-tuning using lower memory requirements and in correcting minimal, residual quantization errors.

Different finetuning methods and their memory requirements. QLoRA improves over LoRA by quantizing the transformer model to 4-bit precision and using paged optimizers like AdamW to handle memory spikes [1]. Image credit: [1]

How does QLoRA reduce memory to 14GB?

Below is the calculation to determine the memory requirements for fine tuning Llama-2–7B with QLoRA.

Memory requirement for loading the model: The Llama-2 7B base model has about 7 billion parameters (although precisely 6.7B), and each parameter is quantized to 4 bits (0.5 bytes). Hence, loading the model would take about 3.5 GB ( ≈ 7 billion parameters × 0.5 bytes).

Memory requirement per trainable parameter consists of:

  • Weight: 0.5 bytes
  • LoRA parameters: 2 bytes
  • AdamW optimizer states: 2 bytes
  • Gradients (always in fp32): 4 bytes
  • Activation: variable (depends on factors like sequence length, hidden size and batch size)

Therefore, the memory per trainable parameter is 8.5 bytes ( ≈ 0.5 + 2 + 2 + 4)

Total memory requirement for trainable parameters: Since LoRA results in an average of 0.4-0.7% [7] trainable parameters, assuming that there are 0.6% of trainable parameters, the total memory requirement for trainable parameters is:

Trainable parameters memory
= Memory per parameter * parameters
= 8.5 bytes * 42 million (0.6% of 7B parameters)
≈ 0.36 GB

Total memory requirement for QLoRA training: The total memory requirement for QLoRA training is around 4 GB, which includes the memory for the base model (3.5 GB) and the memory for trainable parameters ≈ 0.36 GB, resulting in a total training memory requirement of about ≈ 4–5 GB (depending on the number of trainable parameters).

Memory required for inference: If we load the base model in 16-bit precision and merge the LoRA weights of the fine-tuned model, we would at-most use 14 GB of GPU memory for a sequence length of 2048. This memory cost is derived from loading the model in float16 precision and includes activations, temporary variables and hidden states[3], which are always in full-precision (float32) format and depend on many factors including sequence length, hidden size and batch size[2].

Total memory requirements: So, the total memory requirement for QLoRA training with a 4-bit base model and mixed-precision mode, including loading the 32-bit model for inference, would be almost 14 GB depending on the sequence length.

Thus, we can see that using quantization techniques like QLoRA along with PEFT can significantly reduce memory requirements by up to 90%, thereby making fine tuning more accessible and affordable!

Credits

Sources:

[1] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. “QLoRA: Efficient Finetuning of Quantized LLMs.” arXiv preprint arXiv:2305.14314, 2023.

[2] Hugging Face. “Anatomy of Models Memory.” Accessed March 28, 2024. https://huggingface.co/docs/transformers/perf_train_gpu_one#anatomy-of-models-memory.

[1] Hugging Face. “Anatomy of Models Memory.” Accessed March 31, 2024. https://huggingface.co/docs/transformers/perf_train_gpu_one#anatomy-of-models-memory.

[2] Hugging Face. “A standard AdamW uses 8 parameters for each weight tensor.” Accessed March 31, 2024. https://huggingface.co/docs/transformers/v4.23.1/en/perf_train_gpu_one#:~:text=A%20standard%20AdamW%20uses%208,all%20optimizer%20states%20are%20quantized.

[3] Dell Technologies. “LLAMA-2: Efficient Fine-tuning Using Low-Rank Adaptation (LoRA) on Single GPU.” Accessed March 16, 2024. https://infohub.delltechnologies.com/en-US/p/llama-2-efficient-fine-tuning-using-low-rank-adaptation-lora-on-single-gpu/.

[4] PyTorch Team. “Finetune Large Language Models in PyTorch.” PyTorch Blog, March 16, 2024. https://pytorch.org/blog/finetune-llms/.

[5] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv preprint arXiv:2106.09685, 2021.

[6] Hugging Face. “Trl-PEFT: An Efficient Approach for Fine-tuning Large Language Models.” Accessed March 22, 2024. https://huggingface.co/blog/trl-peft.

[7] Databricks Team. “Efficient Fine-Tuning of Large Language Models: A Guide to LoRA.” Databricks Blog, 2024. https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms.

--

--