Guanaco: QLoRA and LLaMA based a Model
Every single day brings forth new advancements, and today is certainly no exception with the highly-anticipated launch of QLoRA and Guanaco. Allow me to delve into one of the most significant breakthroughs in Language Model advancements, the awe-inspiring QLoRA. This groundbreaking innovation revolutionizes fine-tuning colossal models, all accomplished with a single GPU. Additionally, this remarkable achievement has given birth to the first Language Model, Guanaco, fine-tuned using this groundbreaking methodology. The researchers boldly claim that its performance rivals 99.3% of Chat GPT. In this article, I will delve into the intricacies of QLoRA and examine the remarkable performance of Guanaco. Here is the colab notebook for the Guanaco demo and hugging face repository.
Introduction:
In previous approaches, fine-tuning a large AI model demanded substantial computational power and expensive multiple GPU resources, rendering it inaccessible to many. The model would be fine-tuned using a more extensive data type and then undergo a process known as 4-bit quantization to run the model on a consumer-grade GPU post-fine-tuning. Although this process optimized resource usage, it compromised the model’s full potential, resulting in diminished overall outcomes. Enter QLoRA, offering a win-win scenario. This optimized approach enables fine-tuning large LLMs using just a single GPU while preserving the high performance of a full 16-bit model through 4-bit quantization.
The QLoRA method successfully fine-tuned a 65 billion parameter model using a single 48-gigabyte GPU within a mere 24 hours. This accomplishment is undeniably revolutionary.
Moreover, the code for QLoRA, including CUDA kernels for 4-bit training, has been released and made available under the MIT license. While the Guanaco model family, derived from the LLaMA models, is not open-source and requires access to the LLaMA models, the QLoRA code is open-source. Firstly let’s talk about what quantization is.
Quantization:
Quantization reduces data size by converting high-bit data types to lower-bit types. For example, a 32-bit floating point number can become a lower-bit integer. This process involves rescaling the input data to fit the target range through normalization. However, outliers in the data can skew quantization, wasting computational resources. To address this, split the input tensor into smaller blocks and quantize them independently. Each block has its own quantization constant, optimizing bin usage. Known as block-wise kbit quantization, this method enhances outlier resistance, improves computational efficiency, and boosts performance when handling large datasets.
In addition to the benefits of quantization, LoRA (Low-Rank Adaptation) further enhances computational efficiency and has emerged as a game-changing technique in the realm of Natural Language Processing (NLP), particularly for large-scale language models like GPT-3 175B. In LoRA, the focus shifts towards keeping the pre-trained model weights fixed and introducing trainable rank decomposition matrices into each layer of the Transformer architecture, which serves as the backbone for models like GPT and BERT. By adopting this strategy, LoRA significantly reduces the number of parameters that need to undergo training for specific tasks or domains.
The impact is substantial: LoRA can shrink the number of trainable parameters by a factor of up to 10,000 while cutting down GPU memory requirements by approximately three times when compared to the conventional fine-tuning method.
LoRA introduces adapters, a set of small trainable parameters to further refine the process. These adapters interact with the fixed pre-trained model weights, allowing gradients to flow and optimize the adapters during the training phase. This fine-tuning of the adapters enhances the loss function and ultimately improves the model’s performance on the given task.
In a captivating twist, LoRA incorporates an additional factorized projection alongside the existing linear projection. This introduces a dynamic and adaptable component to the projection equation, boosting the efficiency of the fine-tuning process.
QLoRA:
I am excited to present QLoRA (Quantized Low-Rank Adaptation), a groundbreaking method that enables fine-tuning of quantized 4-bit models without any performance degradation. QLoRA leverages a high-precision technique to quantize a pre-trained model to 4-bit and introduces learnable Low-rank Adapter weights, trained through backpropagation of gradients. With QLoRA, we achieve a remarkable reduction in memory requirements, bringing down the average memory needed for fine-tuning a 65B parameter model from >780GB to <48GB. QLoRA introduces innovative approaches to reduce memory usage while maintaining performance, such as 4-bit NormalFloat quantization and Paged Optimizers.
4-bit NormalFloat Quantization:
The 4-bit NormalFloat (NF) data type, an advancement in quantization, offers an optimal approach by ensuring an equal distribution of values in each quantization bin. It leverages Quantile Quantization, which estimates quantiles through the empirical cumulative distribution function. While quantile estimation can be computationally expensive, fast quantile approximation algorithms like SRAM quantiles help mitigate this limitation. However, these approximations introduce quantization errors for outliers, which are often crucial values. In scenarios where input tensors originate from a fixed distribution up to a quantization constant, exact quantile estimation becomes computationally feasible, avoiding costly estimation processes and approximation errors.
Double Quantization:
To further enhance memory efficiency, Double Quantization (DQ) are introduced. DQ involves quantizing the quantization constants themselves, leading to additional memory savings. While precise 4-bit quantization requires a small blocksize, it also incurs a significant memory overhead. Implementing DQ can alleviate this concern and achieve more efficient memory utilization.
Paged Optimizers
Paged Optimizers efficiently manage memory by utilizing the NVIDIA unified memory feature for seamless CPU-GPU transfers. This allows for error-free GPU processing when memory is limited. Paged memory allocation for optimizer states ensures automatic eviction to CPU RAM and retrieval to GPU memory during the optimizer update step. This optimized approach minimizes memory usage and enables smooth processing in memory-constrained situations.
Guanaco:
Top QLORA-tuned model, Guanaco 65B, outperforms other open-source chatbot models and competes with ChatGPT. Based on automated and human evaluations, Guanaco 65B and 33B have high win probabilities (30%) against GPT-4 according to Elo ratings from human annotators. Guanaco 65B achieves 99.3% performance relative to ChatGPT, surpassing other models. Guanaco 33B, with 4-bit precision, outperforms Vicuna 13B with a smaller memory footprint (21 GB vs. 26 GB). Additionally, Guanaco 7B maintains a compact 5 GB size on modern phones while significantly outperforming Alpaca 13B.
Dataset and training:
Regarding the training data, we want to highlight that the Guanaco models are trained on the multilingual OASST1 dataset, which includes prompts in various languages. It would be interesting for future research to explore how this multilingual training improves performance on instructions in languages other than English and its impact on the performance gap between Vicuna 13B (trained only on English data) and Guanaco 33B and 65B on the OA benchmark.
To ensure the integrity of the data, fuzzy string matching and manual inspection are conducted, finding no overlapping prompts between the OASST1 dataset and the Vicuna benchmark prompts.
It is worth noting that the model is trained solely using cross-entropy loss (supervised learning) without relying on reinforcement learning from human feedback (RLHF). This emphasizes the need for further investigation into the tradeoffs between simple cross-entropy loss and RLHF training. The introduction of QLORA offers an opportunity for large-scale analysis without the need for excessive computational resources.
Discussion:
In this article, we have presented a comprehensive study on QLoRA, a novel method for achieving 16-bit full finetuning performance using a 4-bit base model and Low-rank Adapters (LoRA). While we have demonstrated evidence of QLORA’s capability, we acknowledge that this method has not established its ability to match full 16-bit finetuning performance at larger scales such as 33B and 65B due to resource limitations. This remains an area for future exploration
The limitation of this study is the evaluation of instruction finetuning models, since they have provided evaluations on MMLU, the Vicuna benchmark, and the OA benchmark, but still have not evaluated other benchmarks such as BigBench, RAFT, and HELM as different benchmarks may capture different aspects of model performance.
Here is the paper on QLoRA for an in-depth study of QLoRA and its implementation in colab notebook.