LoRA and QLoRA- Effective methods to Fine-tune your LLMs in detail.

Levin M S

10 min readDec 5, 2023

PEFT Approach

What is Fine-tuning of LLM Models?

We have a bunch of open-source available pre-trained LLM models. We can find several transformers in Huggingface’s search space. But all of these models were trained for a particular task and then the saved model is uploaded to Huggingface. So to use these transformer architectures according to the use case, we fine-tune or tweak some of their parameters and then use them.
so fine-tuning can be defined as the process which allows LLMs to adapt to specific tasks by adjusting their parameters, making them suitable for NLP (Natural Language Processing) related tasks. They are built upon the pre-trained knowledge of the model.

PEFT — Parameter Efficient Fine Tuning (for Billion scale models and low-resource hardware)

Why PEFT?

Issues with Full Fine-Tuning:

We end up with huge weights to train.
Needs more computation power to train: With the increase in size and depth of the model, we need much bigger and multiple GPUs to train and fine-tune them.
Increase in size of file
- Difficulty in portability of the model because full fine-tuning of LLMs according to earlier approaches results in large checkpoints which take up to GBs of storage.
- For eg: Google’s T5-XXL, and Bigscience’s mt0-XXL, are different transformer models which take up to 40–50 GB of storage. And so fully fine-tuning these models will lead to 40–50 GB checkpoints for each downstream dataset. These Models have huge parameters easily varying from 10–20B parameters. These models are trained with slightly higher learning rates(1e-4 and 3e-4).

PEFT Methods:

LoRA
Prefix tuning
P tuning
Prompt Tuning
QLoRA

We will be focusing only on LoRA and QLoRA in this Blog. Will discuss about remainig methods in future blogs.

Full Fine-Tuning of Models without PEFT Approach.

In the normal fine-tuning of the model without using PEFT, the hidden layer takes weight W0 which has (d x k) trainable parameters when x takes (k x 1) and h(x) takes (d x 1) parameters. With this being said, its clear that the size of parameters is so huge and so pre-training and fine-tuning the model gets more and more difficult.

LoRA — Low Rank Adaption Method

Working of LoRA:

It is a 16 bit transformer.
It allows us to fine-tune only a small number of extra weights in the model while we freeze most of the parameters of the pre-trained network.
So we are not training the original weights. Instead, we add some extra weights to the model and then we train those.
The advantage is that we still have the original weights. This also tends to help with stopping catastrophic forgetting.
Catastrophic forgetting: It is when models tend to forget what they were trained on when we attempt to do fine-tuning. This happens when we fine-tune too much.

In the fine-tuning of the model without using LoRA, the hidden layer takes weight (W0+∇W) where the original weight W0 is kept idle, i.e. we don't use them in our fine-tuning step and hence is kept frozen.
∇W is our newly added weight where ∇W = BA, where B and A are 2 matrices with B having (d x r) dimension and A having (r x k) dimension.
r : the rank of the update matrices, expressed in ‘int’. Lower rank results in smaller update matrices with fewer trainable parameters. So over here we have taken r as 2, which is a smaller number and hence we will have fewer no.of.trainable parameters.
Now we have only ((d + k) x r) trainable parameters which are less than the original (d x k) parameters.

Code Implementation:

Let’s perform sentiment analysis using pre-trained Transformers from Hugging Face.

Necessary import functions

from datasets import load_dataset, DatasetDict, Dataset

from transformer(
      AutoTokenizer,
      AutoConfig,
      AutoModelForSequenceClassification,
      DataCollatorWithPadding,
      TrainingArguments,
      Trainer)

from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig
import evaluate
import torch
import numpy as np

Base Model: I’m taking DistilBERT

model_checkpoint = 'distilbert-base-uncased'
#define label maps
id2label = {0: 'Negative', 1:'Positive'}
label2id = {'Negative':0, 'Positive':1}

#generate classification model from model_checkpoints
model = AutoModelForSequenceClassification.from_pretrained(
           model_checkpoint, num_labels = 2, id2label = id2label, label2id = label2id)

Load Dataset: I am going with sst2 dataset

# sst2
# The Stanford Sentiment Treebank consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. It uses the two-way (positive/negative) class split, with only sentence-level labels.
dataset = load_dataset("glue", "sst2")
dataset

# Output
'''
DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})'''

# display % of training data with label=1
np.array(dataset['train']['label']).sum()/len(dataset['train']['label'])

# Output
'''
0.5578256544269403
'''

Preprocess data

# create tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)

# add pad token if none exists
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

# create tokenize function
def tokenize_function(examples):
    # extract text
    text = examples["sentence"]

    #tokenize and truncate text
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=512
    )

    return tokenized_inputs

# apply it to all texts in the dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset

# Output
'''
DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 1821
    })
})'''

Data Collator: We do this to avoid padding small sequences collectively on a larger scale. Rather we pad every individual small sequences and then combine them all to save time and avoid over utilisation of memory. i.e for eg: instead of padding the entire 1000 words in a sequence at a time, we dynamically pad sequences word by word one at a time and then combine them all.

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Evaluation Metrics:

# import accuracy evaluation metric
accuracy = evaluate.load("accuracy")

# define an evaluation function to pass into trainer later
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": accuracy.compute(predictions=predictions, 
                                          references=labels)}

Apply untrained model to text

# define list of examples
text_list = ["this one is a pass.", "not a fan, don't recommend.", "it 's just incredibly dull.",
             "impresses you with its open-endedness and surprises."]

print("Untrained model predictions:")
print("----------------------------")
for text in text_list:
    # tokenize text
    inputs = tokenizer.encode(text, return_tensors="pt")
    # compute logits
    logits = model(inputs).logits
    # convert logits to label
    predictions = torch.argmax(logits)
    print(text + " - " + id2label[predictions.tolist()])

# Output
'''
Untrained model predictions:
----------------------------
this one is a pass. - Negative
not a fan, don't recommend. - Negative
it 's just incredibly dull . - Positive
impresses you with its open-endedness and surprises. - Negative

As we see the model performs very badly without training it.

Train Model — Fine tuning with LoRA

peft_config = LoraConfig(task_type="SEQ_CLS", # Sequence Classification.
                        r=4,  # Intrinsic rank of trainable weight matrix.
                        lora_alpha=32,  # similar to Learning rate.
                        lora_dropout=0.01, # probability of dropout nodes.
                        target_modules = ['query']) # LoRA is applied to the query layer.


model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

Defining Hyperparameters and training arguments:

# Hyperparameters
lr = 1e-3 # size of optimization step
batch_size = 4 # No.of.examples processed per optimization step
num_epochs = 10 # No.of.times the model runs through training data.

# training arguments
training_args = TrainingArguments(
    output_dir= model_checkpoint + "-lora-text-classification",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

Trainer Object

trainer = Trainer(
    model=model, # Our PEFT Model, here it is distilbert
    args=training_args, # Hyperparamaters
    train_dataset=tokenized_dataset["train"], # training data
    eval_dataset=tokenized_dataset["validation"], # validation data
    tokenizer=tokenizer, # tokenizer
    data_collator=data_collator, # dynamic sequence padding
    compute_metrics=compute_metrics,  # model perfomance evaluation metric
)

# train model
trainer.train()

Generate the post-training predictions

model.to('cpu') # mps for Mac

print("Trained model predictions:")
print("--------------------------")
for text in text_list:
    inputs = tokenizer.encode(text, return_tensors="pt").to("cpu") # to('mps') for Mac

    logits = model(inputs).logits
    predictions = torch.max(logits,1).indices

    print(text + " - " + id2label[predictions.tolist()[0]])

# Output
'''
Trained model predictions:
----------------------------
this one is a pass. - Positive
not a fan, don't recommend. - Negative
it 's just incredibly dull . - Positive
impresses you with its open-endedness and surprises. - Positive

QLoRA — Quantized Low Rank Adaption Method

How is it different from LoRA:

It is a 4-bit transformer.
QLoRA is a finetuning technique that combines a high-precision computing technique with a low-precision storage method. This helps keep the model size small while still making sure the model is still highly performant and accurate.
QLoRA uses LoRA as an accessory to fix the errors introduced during the quantization errors.

Working of QLoRA:

QLoRA works by introducing 3 new concepts that help to reduce memory while retaining the same quality performance. These are 4-bit Normal Float, Double Quantization, and Paged Optimizers.

4-bit Normal Float (NF4):

4-bit NormalFloat is a new data type and a key ingredient to maintaining 16-bit performance levels. Its main property is this: Any bit combination in the data type, e.g. 0011 or 0101, gets assigned an equal number of elements from an input tensor.
4-bit Quantization of weights and PEFT and train injected adapter weights (LORA) in 32-bit precision.
QLoRA has one storage data type (NF4) and a computation data type (16-bit BrainFloat).
We dequantize the storage data type to the computation data type to perform the forward and backward pass, but we only compute weight gradients for the LORA parameters which use 16-bit BrainFloat.

1. Normalization: The weights of the model are first normalized to have zero mean and unit variance. This ensures that the weights are distributed around zero and fall within a certain range.
2. Quantization: The normalized weights are then quantized to 4 bits. This involves mapping the original high-precision weights to a smaller set of low-precision values. In the case of NF4, the quantization levels are chosen to be evenly spaced in the range of the normalized weights.
3. Dequantization: During the forward pass and backpropagation, the quantized weights are dequantized back to full precision. This is done by mapping the 4-bit quantized values back to their original range. The dequantized weights are used in the computations, but they are stored in memory in their 4-bit quantized form.

There are “buckets” or “bins” of data where the data is quantized. Both the numbers 2 and 3 fall into the same quantile, 2. This quantization process allows you to use fewer numbers by “rounding off” to the nearest quantile.

Double Dequantization:

Double quantization refers to the unique practice of quantizing the quantization constants utilized in the 4-bit NF quantization process. While it may seem inconspicuous, this approach has the potential to yield an average savings of 0.5 bits per parameter, as highlighted in the associated research paper. This optimization proves particularly beneficial within the context of QLoRA, which employs Block-wise k-bit Quantization. In contrast to quantizing all weights collectively, this method involves segregating weights into distinct blocks or chunks that undergo independent quantization.

The block-wise quantization method results in the generation of multiple quantization constants. Interestingly, these constants can undergo a secondary round of quantization, offering an opportunity for additional space savings. This strategy remains effective due to the limited number of quantization constants, mitigating the computational and storage demands associated with the process.

Paged Optimizers:

As demonstrated earlier, quantile quantization involves the creation of buckets or bins to encompass a wide range of numerical values. This process results in multiple distinct numbers being mapped to the same bucket, exemplified by the conversion of both 2 and 3 into the value 3 during quantization. Consequently, a dequantization of weights introduces an error of 1.
Visualizing these errors across a broader weight distribution in a neural network reveals the inherent challenges of quantile quantization. This discrepancy underscores why QLoRA functions more as a fine-tuning mechanism than a standalone quantization strategy, despite its applicability for 4-bit inference. During fine-tuning with QLoRA, the LoRA tuning mechanism comes into play, involving the creation of two smaller weight update matrices. These matrices, maintained in a higher precision format such as brain float 16 or float 16, are then utilized to update the neural network weights.

It’s noteworthy that throughout backpropagation and the forward pass, the weights of the network undergo de-quantization, ensuring that actual training occurs in higher precision formats. Although the storage remains in lower precision, this deliberate choice introduces quantization errors. However, the model training process itself exhibits the capacity to adapt and mitigate these inefficiencies inherent in the quantization process. In essence, the LoRA training approach with higher precision aids the model in learning about and actively reducing quantization errors.

Variants of QLoRA

1. QALoRA: Quantization Aware LoRA

QALoRA was mainly released for finetuning diffusion models, but can easily be generalized for training any type of models, just like LoRA.
‍The difference between QLoRA and QALoRA is that QALoRA is quantization aware meaning the weights of the LoRA adapters are also quantized along with the weights of the model during the finetuning process. This helps in more efficient training as there is no need for the conversion step to update the models during the backpropagation process.

2. LongLoRA:

LongLoRA, a distinctive variant of the LoRA fine-tuning technique, is tailored for training models with extended contextual understanding. This innovative approach leverages a concept known as SHIFT SHORT ATTENTION. In this process, tokens are organized into chunks or groups, and attention is computed independently within each group. This strategic methodology empowers LongLoRA to seamlessly scale to significantly longer contextual ranges, thanks to the efficient distribution of computational workloads.

Blogs by Levin M S. Follow for more.
Dec 5, 2023.