QLoRA: Fine-Tuning Large Language Models (LLM’s)

12 min readNov 27, 2023

In this blog I will aim to explain the concept and important terminology related to QLoRA. The focus of this blog is to provide better understanding of terminology used in QLoRA paper. This article would be useful for NLP and machine learning enthusiasts who are using LLM’s for fine tuning on their custom data and want to know the theory of how parametric efficient fine tuning using QLoRA works. By the end of this article, you will know everything you should about QLoRA.

In this blog we will look into following key section which will help in understand QLoRA:

Fine tuning
Parametric Efficient Fine Tuning (PEFT)
LoRA
QLoRA
4-Bit Normal Float
Quantization
Problems with outliers
Block wise k-bit quantization
Double Quantization
Dequantization
Paged Optimizers

Fine tuning:

Fine tuning is a process where a pre-trained model is further trained on a custom dataset to adapt it for particular tasks or domains. In context of LLM’s fine tuning involves training the model like LLaMA, Falcon and other open-source models on specific dataset to enhance its performance in specific contexts.

Why fine tuning?

We also should know why fine-tuning is so helpful and important in the field of Natural Language Processing. Even though pre-trained models are trained on large datasets and have a broad understanding of language, they might not be specialized in certain tasks. Fine-tuning adjusts the model to be more proficient in specific tasks like sentiment analysis, question answering, or text summarization, Customer Support, Medical Inquiries. Fine-tuning adapts the model to different domains (like legal, medical, or technical fields) specific tasks, improving its accuracy and effectiveness in those areas.

For example, fine-tuning a Large Language Model (LLM) like GPT for question answering, especially when it’s already trained on the English language, can significantly improve its performance and accuracy in this specific task. In order to fine-tune it for question answering for medical inquiries, the model should train on Q&A dataset and it learns understand the questions and how to retrieve and formulate appropriate answers.

Challenges of fine tuning:

Fine-tuning LLM’s offers numerous benefits, but it also comes with significant challenges. Depending on the size of the model and the fine-tuning dataset, the process can take a significant amount of time and also high-performance GPUs or TPUs are often required to handle the computation load. LLM’s are large in size and storing the parameters of these models, especially when multiple versions are maintained (pre-trained and fine-tuned models), requires considerable storage capacity. When an LLM is fine-tuned on a specific task or dataset, the model can perform better in that area, losing its ability to perform well on more general tasks it was originally trained on.

Parameter-efficient Fine-tuning (PEFT):

Parameter-efficient Fine-tuning overcomes the problems of consumer hardware, storage costs by fine tuning only a small subset of model’s parameters significantly reducing the computational expenses while freezing the weights of original pretrained LLM. Additionally, this resolves the problem of catastrophic forgetting, which is a behavior seen when LLMs are fully adjusted.

When employing PEFT methods, the amount of storage required is only a few MBs for each downstream dataset, while still attaining performance comparable to full fine-tuning. For example, with full fine-tuning, 40GB of storage is required for each downstream dataset. The pretrained LLM is combined with the small trained weights from PEFT techniques and this model is used for numerous tasks. This method of fine tuning helps to get performance similar to full fine-tuning with less trainable parameters. There are several Parameter-efficient fine-tuning techniques and as follows:

Adapter
LoRA
Prefix tuning
Prompt tuning
P-tuning
IA3

In the below figure we can see that adapter layers are added after multi-head attention and feed-forward layers in the transformer architecture. The parameters of these added layers are only updated during fine-tuning while keeping the rest of the parameters frozen.

LoRA:

LoRA (Low-Rank Adaptation of Large Language Models) is a fine-tuning technique to train LLM’s on specific tasks or domains. This technique introduces trainable rank decomposition matrices into each layer of transformer architecture and also reduces trainable parameters for downstream task while keeping the pre trained weights frozen. LoRA papers says that this method can minimize the number of trainable parameters by up to 10,000 times and the GPU memory necessity by 3 times while still performing on par or better than fine-tuning model quality on various tasks.

Adaptors fine tuning method has Inference latency problem which is resolved by LoRA. It adds values to transformers instead of adding layers. A large matrix is expressed as the product of two smaller matrix in low-rank decomposition. This assumes that redundant information is often easily stored in a big matrix, especially in high-dimensional spaces.

Rather than altering the weight matrix W of a layer in all of its components, LoRA creates two smaller matrices, A and B, whose product roughly represents the modifications to W. The adaptation can be expressed mathematically as Y = W+AB, where A and B are the low-rank matrices. If W is an mxn matrix A might be mxr and B is rxn where r is rank and much smaller than m,n. During fine tuning only A and B are adjusted enabling the model to learn task specific features.

Read more about LoRA at Low-Rank Adaptation of Large Language Models

QLoRA:

QLoRA is the extended version of LoRA which works by quantizing the precision of the weight parameters in the pre trained LLM to 4-bit precision. Typically, parameters of trained models are stored in a 32-bit format, but QLoRA compresses them to a 4-bit format. This reduces the memory footprint of the LLM, making it possible to finetune it on a single GPU. This method significantly reduces the memory footprint, making it feasible to run LLM models on less powerful hardware, including consumer GPUs.

According to QLoRA paper:

QLORA introduces multiple innovations designed to reduce memory use without sacrificing performance: (1) 4-bit NormalFloat, an information theoretically optimal quantization data type for normally distributed data that yields better empirical results than 4-bit Integers and 4-bit Floats. (2) Double Quantization, a method that quantizes the quantization constants, saving an average of about 0.37 bits per parameter (approximately 3 GB for a 65B model). (3) Paged Optimizers, using NVIDIA unified memory to avoid the gradient checkpointing memory spikes that occur when processing a mini-batch with a long sequence length.

The above figure shows the result of LLaMA 2 7B model trained on different floating points and results of models on various tasks. The model trained on NF4 and float 4-bit gives better results than LoRA and LLaMA 2 7B base model, while 4-bit NormalFloat perform slightly better performance than float4 datatype. QLoRA decreases the memory requirements by almost using NF4 type. However, the tradeoff is a slower training time, which is to be expected due to the quantization and dequantization steps.

4-bit Normal Float:

NF4 is a data type specifically designed for AI applications, particularly in the context of quantizing the weights of neural networks to reduce memory footprints of models significantly while attempting to maintain performance. This is crucial for deploying large models on less powerful hardware. NF4 is information-theoretically optimal for data that has a normal distribution, which is a common feature of neural network weights and can represent these weights more accurately than a standard 4-bit float would, within the given bit constraint.

While standard 4-bit Float is more general-purpose floating-point representation and not specifically optimized for specific applications. 4-bit float has very limited precision and range and due to its limitations in precision and range, standard 4-bit floats are less common in AI and machine learning applications, especially for tasks requiring high precision in calculations.

Let’s see how the given number is stored in floating point for various floating points datatypes.

Each floating point contains 3 different parts which store details about the stored number, that is sign, exponent and fraction also known as mantissa for the given number. The number is first converted into binary format and then stored in the datatype. Each datatype differs in the number of bits they use and hence in their precision and range. For example, FP32 can represent numbers approximately between ±1.18×10^-38 and ±3.4×10³⁸. FP32 is single-precision binary floating point format and has been used as the default format to store weights and biases in Deep Learning. while Fp8 has a range of [-127, 127] and NF4 has a range of [-8, 7].

QLoRA uses brainfloat16 (bfloat16) datatype to perform computational operation that is during forward and backward pass. Brain Floating Point was developed by Google for use in machine learning and other applications that require high throughput of floating-point operations.

Quantization:

Quantization is a technique that is helpful in reducing the size of the model by converting high precision data to low precision. In simple terms, it converts datatype of high bits to fewer bits. For example, converting FP32 to 8-bit Integers is a quantization technique.

4-bit NormalFloat Quantization:

4-bit NormalFloat Quantization is a method designed to efficiently quantize the weights of neural networks into a 4-bit format. NormalFloat data type is designed to optimally quantize data, particularly for use in neural networks and based on a method called “Quantile Quantization” which ensures that each bin (or category) in the quantization process has an equal number of values from the input data (in this case, the weights of a neural network).Quantiles are essentially cutoff points that divide your data into equal parts based bits of a datatype (for example in NF4 datatype we have 4 bits for quantization, so we have 2⁴ = 16 distinct values).

The weights of pretrained neural networks are assumed to follow a zero-centered normal distribution, meaning they are distributed around a central value of zero. The weights are normalized in the range of [-1, 1]. This normalization is achieved by diving each weight by absolute maximum value (this method also called absolute maximum rescaling). By normalizing the input data, we are distributing the weights around zero and the less bits are required to store the exponential data of a tensor/weight parameter.

To quantize a normal distribution into 4-bit datatype we have 16 different values(bins) and to find these values we split the normal distribution into 16 pieces of equal width. Then we take the values in each of these bins and quantize it. The exact values for NF4 data type (16 bins) are as follows:

[-1.0, -0.6961928009986877, -0.5250730514526367, -0.39491748809814453, -0.28444138169288635, -0.18477343022823334, -0.09105003625154495, 0.0, 0.07958029955625534, 0.16093020141124725, 0.24611230194568634, 0.33791524171829224, 0.44070982933044434, 0.5626170039176941, 0.7229568362236023, 1.0]

quantized int4 tensor = round (16/absolute maximum tensor) *F32 tensor

Consider a tensor of 0.686 after normalizing between [-1, 1] and this value is compared to closest bin value of int4 tensor. The closest value is 0.72 and its index is 15. So instead of storing 0.686 Float32 datatype int4 datatype store 15 after quantization.

Problem with outliers:

Outliers are very important in neural networks. Even though model weights are normally distributed, there are some outliers which are very important because removing changing this will affect the quality of the model. However, outlier in the quantization effect the process. For example, in the below figure the outlier at -10 affects the distribution and the bins between -10 and -3 are empty and only 8 quantization bins are filled with values, and this makes quantization equal to 3-bit quantization. This issue might be severe and can degrade the performance.

Block wise k-bit quantization:

Block-wise quantization divides input tensors into smaller blocks and quantizes each block independently which reduces the problem of outlier. In this process we split the input tensor into chunks and each chuck is quantized independently with each having their own quantization constant (total bins of datatype/absolute max tensor). Even with outliers we get much higher quantization precision and stability with block wise k-bit quantization by confining outliers to blocks.

Double quantization:

Double quantization is the process of quantizing the quantization constant to reduce the memory down further to save these constant. To perform dequantization technique we need to store the quantization constants. If we employed blockwise quantization, then we will have n quantization constants in their original datatype. In the case of expansive LLM’s which have substantial number of quantization constants that must be stored, leading to increased memory overhead.

According to QLoRA paper:

For example, using 32-bit constants and a blocksize of 64 for W, quantization constants add 32/64 = 0.5 bits per parameter on average. Double Quantization helps reduce the memory footprint of quantization constants. More specifically, Double Quantization treats quantization constants C2FP32 of the first quantization as inputs to a second quantization. This second step yields the quantized quantization constants C2FP32 and the second level of quantization constants C1FP32. We use 8-bit Floats with a blocksize of 256 for the second quantization as no performance degradation is observed for 8-bit quantization, in line with results from Dettmers and Zettlemoyer [13]. Since the C2FP32 are positive, we subtract the mean from C2 before quantization to center the values around zero and make use of symmetric quantization. On average, for a blocksize of 64, this quantization reduces the memory footprint per parameter from 32/64 = 0.5 bits, to 8/64 + 32/(64 · 256) = 0.127 bits, a reduction of 0.373 bits per parameter.

Even though the memory reduced is only 0.373 per parameter, large models with 70B parameters will have huge impact in reducing the memory footprint.

Dequantization:

Dequantization is the inverse of quantization. During quantization the weights of the pre-trained model are quantized to smaller data types, such as converting 32-bit data into 4-bit NormalFloat. Quantization significantly reduces the memory requirements for training. However, for inferences and backpropagation during training, these quantized weights (which are frozen and not trained) should be dequantized back to 32-bit for computation.

During model fine tuning we need to compute gradients to indicate how each weight should be altered to minimize the loss function. So, in order to perform computation on original datatype we need to implement dequantization technique. If quantized values are used for forward propagation, the calculated gradients become less accurate and unstable. The dequantization is performed using quantization constant and quantized values.

When we consider the example used in quantization after we dequantize to original datatype we get 0.72 but the original value is 0.749. There is a difference of 0.029, even though this difference looks small but when data moves through several layers this can change the performance of the model. If we incorporated double quantization, then we should also dequantize quantization constant to original tensor(FP32) before using it.

Paged optimizers:

The concept of Paged optimizers is used to manage memory usage during training of large language models (LLMs). When training large models with billions of parameters GPU’s running out of memory is common problem. Paged optimizers are used to address these memory spikes that occur during model training.

NVIDIA unified memory facilitates automatic page-to-page transfers between the CPU and GPU, similar to regular memory paging between CPU RAM and the disk. When the GPU runs out of memory, these optimizer states are moved to the CPU RAM and are transferred back into GPU memory when needed.