E3 : Quantized Low-Rank Adaptation (QLoRA)

Published in

Research Papers Summarized

4 min readJun 10, 2023

Quantizing pre-trained model weights combined with LoRA adapters without compromising much on performance can lead to better fine-tuning techniques

Paper Name : QLoRA: Efficient Finetuning of Quantized LLMs

URL : https://arxiv.org/abs/2305.14314

Authors : University of Washington - Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

Conference : NeurIPS 2023

Please find the annotated paper here

Problem Statement :

With the rapid improvement of LLMs, together grows the size of the LLMs.
This remain a major barrier for adapting LLMs into academic/personal researches or organisations that cannot afford GPUs of humongous size.
Though LoRA is effective, it still considers only the weight parameters for memory footprint reduction.But in contrast gradient activation consumes more memory than the parameters comparatively.

Solution :

QLoRA = Quantised frozen pre-trained model weights + LoRA
The solution includes 3 major findings.
1. a new datatype 4-bit NormalFloat
2. Double quantization
3. Paged optimisers

Approach :

Block-wise k-bit quantization - The new datatype NF4 values are normalised in the range [-1,1] built using Quantile Quantization. The pre-trained weights (W) of datatype (FP32) are then divided into blocks (B chunks) and then quantised by normalising it within the datatype (NF4) range of [-1,1] using absolute maximum rescaling. Block-wise quantization is done to avoid the outlier weights from impacting the quantile bins.
Double quantization - Quantising the quantization constant again to reduce the memory footprint during fine-tuning. Quantization of quantised constants helped to reduce memory footprint from 0.5 bits/parameter to 0.127 bits/parameter.
Paged Optimisers - Gradient checkpointing proves efficient to overcome the issue of memory consumption due to gradient activation. But during back propagating, there can be memory spikes and in-turn OOM (out of memory) issues in GPU. Hence using paged optimisers, the optimiser states can be transferred from GPU to CPU when there is a memory spike and can be paged back to GPU when there is a need.
QLoRA uses storage datatype of NF4 for pre-trained weights, and computation datatype BF16 for input features. The storage datatype is dequantized to BF16 datatype during computation (with input features) in forward and backward pass and the gradients are updated only in LoRA weights. No gradient updates are applied to pre-trained weights.

Experimental Setup :

Guanaco family of models (7B,13B,33B,65B) was fine-tuned with QLoRA technique.
An instruction based fine-tuning was done on datasets like OASST1,HH-RLHF
Cross entropy loss was used during instruction fine-tuning rather than RLHF
Benchmarks like Vicuna,MMLU,GLUE were evaluated
Elo rating - a gaming model where models were made to compete each other and results were evaluated using both human annotators and GPT-4

Observations :

QLoRA using NF4 datatype for LLAMA models with and without double quantization yielded better performance gains than a normal 4 bit Float datatype model. Winogrande, HellaSwag, PiQA, Arc-Easy, and ArcChallenge were evaluated for this using zero-shot.
QLoRA with NF4 datatype (with and without double quantization) matches the performance of 16 bit LoRA fine-tuned models and 16 bit fully fine-tuned models as well.
This implies that the precision lost during 4-bit quantization(from 32bit) in QLoRA is recovered using LoRA adapter fine-tuning as part of QLoRA.
Guanaco 65B model instruction fine-tuned using QLoRA achieves 99.3% win rate against ChatGPT on Vicuna benchmark with just 24 hours of fine-tuning on 48GB GPU.
Guanaco 33B model instruction fine-tuned using QLoRA achieves 97.8% win rate against ChatGPT on Vicuna benchmark with just 12 hours of fine-tuning on 24GB GPU.
Guanaco 7B instruction fine-tuned using QLoRA consumes only 5GB GPU while still achieving good results.
The above mentioned chatbot models were evaluated using Elo rating done by human annotators and GPT-4.
Results of instruction fine-tuned Guanaco models show that the results on benchmarks are better even for models instruction fine-tuned with cross-entropy loss compared to RLHF method.
This opens way for reducing the cost of instruction fine-tuning LLMs that can lead to more ChatGPT,Bard alike models.

Conclusion :

Quantised version of LoRA opens door for efficient and quicker ways of adapting LLMs for downstream tasks.
LoRA and QLoRA show that mixed precision (using NF4 and BF16) and low rank attention parameters based fine-tuning is still able to match the results of models fully fine-tuned or fine-tuned using single precision (FP32).
Memory efficient LLMs that can run on consumer GPUs of affordable range could be the way forward for evolution of more advanced and capable LLMs.

Written by Praveen Thenraj