Comprehensive Guide to Adapters, LoRA, QLoRA, LongLoRA with implementation

25 min readOct 18, 2023

in today’s fast-paced technological landscape, large AI models are propelling breakthroughs across diverse domains. However, tailoring these models to specific tasks or datasets can be a computational and resource-intensive endeavor.

But Why finetuning LLMs is difficult?

ROBERTa base (125M), ROBERTa large(355M), DeBERTA XXL (1.5B), GPT-2 Medium (355M), GPT-2 Large (774M),GPT-3 (175B). Finetuning takes a lot of RAM and time. Need to store full model for each task. -GPT-3 finetuning needs 1.2TB VRAM!

WHAT IS PEFT(Parameter Efficient Fine Tuning)?

As mentioned above, it has become a necessity to fine-tune and use bigger models when it comes to production-grade applications. PEFT techniques allow you to fine-tune the models efficiently and save money and time as a result. This is done by fine-tuning only the most important and relevant parameters in the neural network. The techniques introduce new parameters in the network or freeze the whole model except for some parts to make it easier to train the model.

it has become a necessity to fine-tune and use bigger models when it comes to production-grade applications. PEFT techniques allow you to fine-tune the models efficiently and save money and time as a result. This is done by fine-tuning only the most important and relevant parameters in the neural network. The techniques introduce new parameters in the network or freeze the whole model except for some parts to make it easier to train the model.

Let’s Start with Adapters :

Adapters were one of the first parameter-efficient fine-tuning techniques released. In the paper, they showed that you can add more layers to the pre-existing transformer architecture and only finetune them instead of the whole model. They showed that this technique resulted in similar performance when compared to complete fine-tuning. ‍

On the left, there is the modified transformer architecture with added adapter layers. You can see adapter layers are added after the attention stack and the feed-forward stack. And on the right, you can see the architecture of the adapter layer itself. The adapter layer comprises a bottleneck architecture, it takes the input and narrows it down to a smaller dimension representation and then passes it through a non-linear activation function, and then scales it back up to the dimension of the input. This makes sure that the next layer in the transformer stack will be able to receive the generated output from the adapter layer. ‍ In the paper, the authors show that this method of fine-tuning is comparable to complete fine-tuning while consuming much less compute resources and training time. They were able to attain 0.4% of full fine-tuning on the GLUE benchmark while adding 3.6% of the parameters.

you can have a look on paper here
Adapter Layers Introduce Inference Latency, Prefix-embedding tuning and Prefix-layer tuning Directly Optimizing the Prompt is Hard Reserving a part of the sequence length for adaptation reduces the sequence length available to process a downstream task.

So Enter LoRA (Low Rank Adaptation) — a groundbreaking and efficient fine-tuning technique that harnesses the power of these advanced models for custom tasks and datasets without straining resources or incurring excessive costs.

LoRA has taken the AI community by storm in recent months . In this blog post, we’ll delve into the reasons behind its meteoric rise. We’ll explore the principles underpinning LoRA, its effectiveness in various domains, and the impact it’s having on the open-source community.

Background Concepts

Before diving into LoRA, let’s review some fundamental linear algebra concepts. If you’re comfortable with the basics of linear algebra (particularly matrix rank).

Matrix Rank

The rank of a matrix is the dimension of the vector space generated by its columns, which is given by the number of linearly independent columns (or rows) in a given matrix. It can be proven that the number of independent columns (known as column rank) is always equal to the number of independent rows (called row rank). Hence, for a matrix A with m rows and n columns (represented as Aₘₙ),

Types of Matrices

Based on its rank, a matrix can be primarily classified into two types.

A matrix Aₘₙ is called a full-rank matrix if rank(A) = min(m, n). The matrix shown below is an example of a full rank matrix.

Rank-Deficient Matrix

The opposite of a full rank matrix is rank deficient i.e. rank(A) < min(m, n). The rank-deficient matrix shown below has a rank of 1, as the columns (or rows) of the matrix are not linearly independent of one another.

Low-Rank Matrix: A rank-deficient matrix Aₘₙ is called a low-rank matrix if its rank is significantly lower (no fixed threshold) than the minimum number of rows and columns. Mathematically, rank(A) << min(m, n).

By thinking of matrix rank as the dimensions of the feature space it represents, we can demystify this fundamental concept and grasp its real-world implications.

Matrix Rank as Feature Space Dimensions

Imagine a matrix as a data container, where each column represents a different feature or characteristic of the data. These features could be anything from temperature readings, pixel values in an image, to numerical attributes in a dataset. The rank of a matrix then reveals the effective number of independent features it contains. In other words, it quantifies the richness of information present in the data.

Low-Rank vs. Full-Rank Matrices

Let’s explore this concept further using two types of matrices: low-rank and full-rank.

1.Low-Rank Matrix: A low-rank matrix, despite having the same physical dimensions as a full-rank matrix, encapsulates fewer features or, equivalently, resides in a lower-dimensional feature space. Think of it as a dataset with redundant or highly correlated features. In this scenario, you can imagine that some of the columns in the matrix are merely linear combinations of others. Consequently, the information content is reduced, and the matrix’s rank is lower.

2.Full-Rank Matrix: Conversely, a full-rank matrix represents a feature space with the highest possible dimensionality, given its size. Each column is linearly independent, meaning that no feature is a duplicate or combination of others. This matrix captures the full breadth of information within the data.

Rank Decomposition: Rank decomposition, as discussed earlier, involves factorizing a matrix Aₘₙ into two constituent matrices: Cₘᵣ and Fᵣₙ, where rank(A) = r. This decomposition is a fundamental mathematical concept that allows us to capture the essential structure and information within a matrix in a more compact form. It’s analogous to breaking down complex data into its fundamental building blocks.

LoRA and Rank Decomposition: With LoRA, the weight matrices of these models are decomposed into lower-rank approximations. This process reduces the number of parameters, effectively compressing the model while preserving its critical information. Here’s how it works:

Matrix Factorization: LoRA employs techniques like Singular Value Decomposition (SVD) to factorize the weight matrices of the AI model into lower-rank matrices. This decomposition simplifies the model’s structure without sacrificing its representational power.

the full code is here: https://github.com/rania-hossam/Lora_Implementation_With_Pytorch

LoRA: Low-Rank Adaptation of Large (Language) Models

Low-Rank Adaptation (LoRA) LoRA is an innovative technique designed to efficiently fine-tune pre-trained language models by injecting trainable low-rank matrices into each layer of the Transformer architecture. LoRA aims to reduce the number of trainable parameters and the computational burden while maintaining or improving the model’s performance on downstream tasks.

Let’s See How It Works:

Many previous works have shown that over-parametrized large models reside on a low intrinsic dimension. The main idea behind LoRA is that the change in weights during model adaptation also has a low intrinsic rank/dimension. Concretely, if Wₙₖ represents the weights of a single layer and ΔWₙₖ represents the change of weights during model adaptation, the authors propose that ΔWₙₖ is a low-rank matrix i.e.

The Logic Behind Low-Rank Adaptation in Large Models:

Large AI models are designed to be versatile and capture a broad range of features within their respective domains. Whether it’s understanding language, processing audio and text, or generating images, these models are engineered to excel in general representation. Their ability to perform well across a wide spectrum of tasks, even those they’ve never encountered before (zero-shot tasks), is a testament to their power.

However, the challenge arises when we want to fine-tune these large models for specific tasks or datasets. In these cases, we don’t need to re-invent the wheel or re-learn all the features the model has already mastered. Instead, we only need to emphasize or fine-tune a subset of those features that are relevant to the specific task at hand.

This is where the concept of a low-rank matrix comes into play. When we adapt a large model to a specific task, we’re essentially making minor adjustments, represented by an update matrix (ΔW). In many cases, these adjustments are focused on a limited set of features, which implies that ΔW can be thought of as a low-rank matrix.

In simpler terms, rather than revamping the entire model from scratch, we’re making efficient, targeted modifications. This makes the fine-tuning process faster and computationally more efficient. It’s akin to fine-tuning a musical instrument; you don’t need to re-learn how to play the entire instrument; you just tweak the strings that need adjustment.

Methodology

The technique constrains the rank of the update matrix ΔW using its rank decomposition. It represents ΔWₙₖ as the product of 2 low-rank matrices Bₙᵣ and Aᵣₖ where r << min(n, k). This implies that the forward pass of the layer, originally Wx, is modified to Wx + BAx (as shown in the figure below). A random Gaussian initialization is used for A and B is initially to 0, so BA=0 at the start of training. The update BA is additionally scaled with a factor α/r.

But Why We Didn’t Make A Random initialization?

Let’s go on a quick tangent. How would you initialize A and B? If you initialize it randomly, consider what would happen in the beginning of the training? In each forward pass we would add random noise to the output of an adapted module and we would have to wait for the optimizer to step-by-step correct the wrong initialization, leading to instabilities at the beginning of the finetuning. To mitigate we typically use lower learning rates, smaller initialization values or warm up periods where we limit the effect that these wrong parameters can have, so that we do not destabilize the weights too much. In the LLAMA adapter [3] paper the authors introduce zero gating: They start the value of an adapter’s gate-to be multiplied with the actual weights -with 0 and increase its value over the course of the training. An alternative approach would be to initialize Aand B with 0. But then you would not be able to break symmetry and in the learning process all parameters may be treated as one parameter.

What are the main advantages of LORA?

1.Rank-decomposition matrices have significantly fewer parameters than the original model, which means that trained LORA weights are easily portable.

2.A pre-trained model can be shared and used to build many small LoRA modules for different tasks.

3.LORA makes training more efficient by up to 3 times since we do not need to calculate the gradients or maintain the optimizer states for most parameters.

4.Previous pretrained weights are kept frozen so the model is not as prone to catastrophic forgetting.

5.No extra inference latency

6.LORA is orthogonal to many prior methods and can be combined with many of them, such as prefix-tuning.

About the implementation:In the Transformer architecture, there are four weight matrices in the self-attention module (Wq, Wk, Wv, Wo) and two in the MLP module. LoRA only adapts the attention weights for downstream tasks and freezes the MLP modules.

We use Wq, Wk, Wv, and Wo to refer to the query/key/value/output projection matrices in the self-attention module.

During full fine-tuning, the model is initialized to pre-trained weights and

all are updated to +A by repeatedly following the gradient.

One of the main drawbacks for full fine-tuning is that for each downstream task, we learn a different set of parameters A

LORA adopts a more parameter-efficient approach, where the task-specific parameter increment A = A0(e) is further encoded by a much smaller-sized set of parameters .

https://github.com/rania-hossam/Lora_Implementation_With_Pytorch

QLoRA, or Quantized And Low Rank Adapters

is a new approach to fine-tuning large language models (LLMs) that uses less memory while maintaining speed. QLoRA works by first quantizing the LLM to 4-bits, reducing the model’s memory footprint significantly. The quantized LLM is then fine-tuned using the Low Rank Adapters (LoRA) approach. LoRA enables the refined model to preserve the majority of the accuracy of the original LLM while being significantly smaller and quicker. QLoRA is based on the assumption that the bulk of information in a large language model is contained in the model’s weights, and that the remaining information may be approximated without affecting the model’s accuracy much. QLoRA quantizes the LLM weights to 4-bits, reducing the model’s memory footprint by 8x. The quantized LLM is then finetuned by QLoRA utilizing a method known as Low Rank Adapters (LoRA). LoRA enables the refined model to preserve the majority of the accuracy of the original LLM while being significantly smaller and quicker

Before going deeply in architecture lets talk about the hardware requirements:

GPU: For models with fewer than 20 billion parameters, such as GPT-J, a GPU with at least 12 GB of VRAM is suggested. An RTX 3060 12 GB GPU, for example, can be utilized. If you have a bigger GPU with 24 GB of VRAM, you can use a model with 20 billion parameters, such as GPT-NeoX-20b.

RAM: It is suggested that you have at least 6 GB of RAM. This criteria is met by the majority of current computers.

Hard Drive: Because the GPT-J and GPT-NeoX-20b are large models, you need have at least 80 GB of free space on your hard disk.

In the image, a large Language Model (LLM) is displayed on the left, with time on the x-axis and memory on the y-axis. To fine-tune this LLM without sufficient computational power, we need to apply various optimization techniques. This tradeoff primarily involves sacrificing time to reduce the model’s memory footprint, making fine-tuning feasible. However, it’s crucial to be aware that this approach may extend the fine-tuning process and potentially result in information loss.

we are will go deeper into the Architecture:

optimization is converting all the weight tensors of the LLM from 32-bit tensors to 4-bit tensors. These 4-bit tensors have a range from -8 to 7, effectively offering 16 distinct levels for representation. In contrast, 8-bit tensors encompass a range from -127 to 127, and 32-bit tensors can represent a vast interval from 1.18e-38 to 3.4e38.

4 bit Quantization of the weights and Parameter Efficient Fine-tuning and train injected adapter weights (LoRA) in 32 bit precision QLORA has one storage datatype (usually 4bit normal float) and a computation datatype (16 bit Brain Float) We dequantize the storage datatype and to the computation datatype to perform the forward and backward pass, but we compute weight gradient only for LoRA parameters which use 16 bit BrainFloat. QLORA extends LORA to enhance efficiency by quantizing weight values of the original network, from high-resolution data types, such as Float32, to lower resolution data types like int4. This leads to reduced memory demands and faster calculations.

So at first let’s talk about Computation ? Computation refers to the mathematical operations that are performed on the weights and activations of the network during both the forward pass (when making predictions) and the backward pass (when updating the weights during training). In a typical neural network, these computations are performed using 32-bit floating-point numbers. This is because 32-bit floating-point numbers provide a good balance between precision (the ability to represent numbers accurately) and range (the range of numbers that can be represented). Using 32-bit floating-point numbers for all computations can be memory- intensive. This is where quantization comes in. Quantization is a technique to reduce the precision of the numbers used in the model. In the case of 4-bit quant- ization, the weights and activations of the network are compressed from 32-bit floating-point numbers to 4-bit integers. A 4-bit integer can range from -8 to 7.

you didn’t ask yourself what makes QLoRA is better than LoRA

so at first the normalization step so The weights of the model are first normalized to have zero mean and unit variance. This ensures that the weights are distributed around zero and fall within a certain range.

the second technique is Quantization it takes The normalized weights and then quantized it to 4 bits. This involves mapping the original high-precision weights to a smaller set of low-precision values. In the case of NF4, the quantization levels are chosen to be evenly spaced in the range of the normalized weights. The weights are adjusted to a zero mean, and a constant unit variance. A 4-bit data type can only store 16 numbers. As part of normalization the weights are mapped to these 16 numbers, zero-centered distributed, and instead of storing the weights, the nearest position is stored. Here is an example.

Let’s say we have a FP32 weight, with a value of 0.7113. a 4-bit split between -1 to 1 will be the following number positions.

0.7113 is closest to 0.7329, which is the 13th position. Instead of saving the FP32 of 0.7113, we store 14

int4Tensor=roundedValue(totalNumberOfPositions/absmax(inputXTensor))*FB32WeightsTensor

so the total number here is 16 positions

The value total Number Of Positions, absmax (inputXTensor) quantization constant. is called the Obviously, there is a loss of data when we normalize and quantize, as we move from FP32, which is a high- resolution data type to a low-resolution data type. The loss is not huge, as long as there are no outliers in the input tensor, which might affect the absmax () and eventually upset the distribution.

To avoid that issue, we generally quantize the weights independently by smaller blocks, which will normalize the outliers.

another reason is Dequantization During the forward pass and back- propagation, the quantized weights are dequantized back to full precision. This is done by mapping the 4-bit quantized values back to their original range. The dequantized weights are used in the computations, but they are stored in memory in their 4-bit quantized form

dequantizedTensor=int4Tensor/roundedValue(totalNumberOfPositions/absmax(inputTensor))

So, for the above example the dequantization error is 0.7329-0.7113 =0.0216 (rem: 1 level spaced out is 0.1333-> 1/4 of a space)

Double Quantization (DQ)

is a technique introduced to reduce the memory footprint of quantization constants without degrading performance. It is particularly useful for achieving additional memory savings when using small block sizes for precise 4-bit quantization.

Double Quantization works by treating the quantization constants © of the first quantization as inputs to a second quantization. This second step yields the quantized quantization constants (CFP322) and the second level of quantization constants (CFP82). The second quantization uses 8-bit Floats with a block size of 256, as no performance degradation is observed for 8-bit quantization.

Since the CFP322 values are positive, the mean is subtracted from c2 before quantization to center the values around zero and make use of symmetric quantization.

On average, for a block size of 64, Double Quantization reduces the memory footprint per parameter from 32/64 = 0.5 bits to 8/64 + 32/ (64*256) = 0.127 bits, a reduction of 0.373 bits per parameter.

By applying Double Quantization, QLORA can conserve memory while maintaining high performance during the finetuning process.

here we come with another one is Paged Optimizers is a technique introduced in QLORA to address the problem of out-of-memory errors during gradient check- pointing. Gradient checkpointing is a technique used to conserve memory during backpropagation by dividing the computation into smaller units that can fit into memory. However, this technique can still result in memory spikes that exceed the available GPU memory, leading to out-of-memory errors.

Paged Optimizers works by parti- tioning the model parameters into small groups and processing each group separately using a separate optimizer. During each iteration of the optimization process, only one group of parameters is loaded into memory at a time, while the others remain on disk. This enables QLORA to finetune large models on a single GPU without running out of memory.

The technique uses NVIDIA unified memory to manage the memory pages, enabling the optimizer to seamlessly move data between the CPU and GPU as needed. This reduces the time required to load the data into memory and enables QLORA to achieve high throughput during the finetuning process.

Experiments conducted with QLORA show that Paged Op- timizers can achieve up to 1.9x speedup compared to standard optimizers, while maintaining the same level of accuracy. This makes it a key component in enabling QLORA to finetune large models on a single GPU without memory constraints.

Let’s understand the whole process :

Let us say we have an LLM model with 140 billion parameters, meaning in the forward pass we have 140 billion frozen weights quantized to 4bit precision in the model, and we have injected 1 percent of the original parameters, that is 1 billion LoRA parameter. From the above image, the dark blue box in the center(1%) is 32bit LoRA parameters, and rest are 4bit LLM model weights.

It gets interesting here, whenever the first forward pass wave goes over this, the model weights are dequantized again to 32bit presentation, but on in memory but just computation, so it is not saved in the memory. The forward as a function gives the prediction of the system.

Again during the backward pass, all are again re-computed (dequantized) back to 32 bit precision, so all the frozen weights are still frozen with respect to memory, but not in regards to calculation, they are integrated in our attention layers, otherwise it would not make sense to have just 1 billion weight tensors here for the calculation during the backward pass, so they all play a significant role in the calculation of single 32bit dark blue tensors above. But the moment wave is done we switch back everything to 4bit again. So, after the forward and backward pass only 32bit dark blue tensors (LoRA weight tensor) are updated and stored in memory, so now we know how this system is so memory friendly. We adjust the 1 “32bit” LORA weight tensors from the rest of 100 “4bit” weights tensors, but they are integral part of the calculation during the forward and backward pass. At the end, all the light blue tensors remain 4bit and frozen and are not stored.

In the image above we have 100 4bit frozen weight tensors and in the middle we have 32bit LoRA weight tensor.

So for these rest light blue 4bit 99 tensors, when they are dequantized to 32bit tensors, they can only have one of the 16 hyperplane values, this is because of the way we do dequant- ization and the dequantization error that we talked about earlier in the blog. The red colour above yellow one is the dequantization error for each single weight tensor or all the 175 billion parameters but there will no error in LoRA parameters since they were never quantized or dequantized. So, when we fine-tune the LoRA weights for a specific task, In our forward and backward the pass, the weights with dequantized errors were responsible for updating my LoRA weight tensors (dark blue weight). With the QLORA fine-tuning the system (NN) learns the new specific task it is fine- tuned on, given its actual dequantiza- tion error of every single weight tensor (kept frozen, not updated) of 4-bit pre- cision only the modified 32-bit QLORA weight tensor is stored in memory 32-bit LoRA weight tensor.

From the image above it can be seen that NFloat4+DQ with 4 bit quantization almost gives higher results from finetuning of the model

the implementaion part is here!

with thinking of which dataset to use in finetuning i choose this dataset https://github.com/kaistAI/CoT-Collection Repository for the paper “The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning”, including 1.84M CoT rationales extracted across 1,060 tasks”

the paper link: https://arxiv.org/abs/2305.14045

you can have a look at the full code : https://github.com/rania-hossam/FINE_TUNING_LLAMA2_QLoRA

!pip install torch accelerate bitsandbytes datasets transformers peft trl scipy

import argparse
import bitsandbytes as bnb
from datasets import load_dataset
from functools import partial
import os
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, Trainer, TrainingArguments, BitsAndBytesConfig, \
    DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import load_dataset
from torch import cuda, bfloat16
import transformers

import torch
import torch.nn as nn

from huggingface_hub import notebook_login

notebook_login()

#model_id = 'meta-llama/Llama-2-13b-chat-hf'
model_id = "meta-llama/Llama-2-7b-hf"

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these

model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=True
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=True
)
model.eval()
print(f"Model loaded on {device}")

mem = model.get_memory_footprint()
print("Memory footprint: {} ".format(mem))

tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=True
)

# Load the dataset from Hugging Face
from datasets import load_dataset

dataset = load_dataset("kaist-ai/CoT-Collection", split="train")

print(f'Number of records: {len(dataset)}')
print(f'Column names are: {dataset.column_names}')

'''
Number of records: 1837928
Column names are: ['source', 'target', 'rationale', 'task', 'type']
'''

dataset_cot = dataset.filter(lambda example: example['type'] == "CoT")
print(f'Number of records: {len(dataset_cot)}')
print(f'Column names are: {dataset_cot.column_names}')

def create_prompt(rec):

  start = "Read the Instruction below and provide an answer."
  question = f"### INSTRUCTION:\n{rec['source']}\n\n"
  response = f"### RESPONSE:\n{rec['rationale']}\n"
  answer = f"Therefore the answer is {rec['target']}\n\n"
  end = "### End"

  parts = [part for part in [start, question, response, answer, end] if part]

  formatted_prompt = "\n\n".join(parts)
  formatted_prompt = formatted_prompt.replace('\\n', '\n')

  rec["text"] = formatted_prompt

  return rec

p = create_prompt(dataset_cot[30000])
print(p)
print(p["text"])
dataset = dataset_cot.map(create_prompt)

dataset = dataset.map(
        batched=True,
        remove_columns=['source', 'target', 'rationale', 'task', 'type']
    )
print(dataset[30000]["text"])

#max length of the model
def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length
mx = get_max_length(model)
mx

#tokenize dataset
dataset = dataset.map(lambda samples: tokenizer(samples['text']), batched=True)
dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < mx)
seed = 42
set_seed(seed)
dataset = dataset.shuffle(seed=seed)

for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:  # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

modules = find_all_linear_names(model)
print(modules)

#['v_proj', 'up_proj', 'down_proj', 'k_proj', 'o_proj', 'q_proj', 'gate_proj']

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,  #attention heads
    lora_alpha=64,  #alpha scaling
    target_modules=modules,  #gonna train all
    lora_dropout=0.1,  # dropout probability for layers
    bias="none",
    task_type="CAUSAL_LM", #for Decoder models like GPT Seq2Seq for Encoder-Decoder models like T5
)
##Get the PEFT Model using the downloaded model and the loRA config
model = get_peft_model(model, config)

# Print Trainable parameters
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
    all_param += param.numel()
    if param.requires_grad:
        trainable_params += param.numel()
print(
    f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)

tokenizer.pad_token = tokenizer.eos_token
trainer = Trainer(
    model=model,
    train_dataset=dataset,
    args=TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        max_steps=100, #20,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
    ),
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

model.config.use_cache = False  # re-enable for inference to speed up predictions for similar inputs

trainer.train()

model.push_to_hub("Venkat-Ram-Rao/Llama2_7B_qlora_CoT_FT-v2",
                  use_auth_token=True,
                  commit_message="fine tuned on kaist-ai/CoT-Collection",
                  private=True)

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "Venkat-Ram-Rao/Llama2_7B_qlora_CoT_FT"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)

mem = model.get_memory_footprint()
print("Memory footprint: {} ".format(mem))

tst = """Read the Instruction below and provide an answer.

### INSTRUCTION:
In this task, you are given an input list A. You need to find all the elements of the list that are numbers and calculate their sum.

['i', '33', 'h', '849', '77']



### RESPONSE:"""
batch = tokenizer(tst, return_tensors='pt')
with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, max_new_tokens=90)

print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))

Therefore the answer is 915

Let’s get another technique is LongLoRA:

Before we dive into architecture let’s discuss some problems that we faced

sometimes you want to apply LLMS on books or long research papers lead to increasing context length that increases the number of possible token combinations the model must learn to predict accurately. This enables more robust long-range modeling but also require more memory and processing power, leading to higher training costs.

but we have to know that Context length refers to the maximum number of tokens the model can remember when generating text. A longer context window allows the model to understand long-range dependencies in text better. Models with longer contexts can build connections between ideas far apart in the text, generating more globally coherent outputs.

Training sequences must contain documents, books, articles, etc., with thousands of tokens. The length of training data sets a limit on usable context length.

and here we have to know that the standard Attention mechanism is so extensive so Scaling transformers to longer sequences faces challenges due to the quadratic complexity O(N²) of full attention.

How we can improve the Efficiency of Attention Mechanism :

Reducing the Quadratic cost has become an active area in research in last year's we can split these improvements into two categories (approximating attention and Exact attention) with using hardware optimization.

let’s start by

sparse Attention

Approximation techniques constrain interactions between sequence positions. Sparse attention limits the number of non-zero attention weights per attention head, while local attention restricts interactions to a sliding window. These approximations reduce computational cost but may degrade accuracy on complex tasks. [2] Recent work has focused on op- timizing attention to leverage GPU architectures. Sparse attention approximates attention by only computing the attention weights for a subset of the input tokens instead of all possible pairs, thus saving time and memory. There are different ways to implement sparse attention, such as using fixed or static patterns (e.g., local, strided, or block attention) or dynamic or adaptive patterns that depend on the input sequence (e.g., entmax or dynamic sparse attention).

Sparse attention can improve the efficiency and scalability of Trans- formers, especially for long sequences, but it may also sacrifice some representation power and accuracy. Quadratic attention can achieve high performance and quality, but it may also be computationally expensive and impractical for large-scale applications. Therefore, there is a trade-off between sparsity and complexity in attention mechanisms.

Stacked self-attention layers allow modeling long-range dependencies in text. The standard attention mechanism used in Transformers, which computes the attention weights for all possible pairs of input tokens, has a complexity of O(n²). It means that the computation and memory requirements grow quadratically with the input sequence length, limiting Transformers’ scalability and efficiency. When generating text, the model has to compute the attention matrix first. With a 100K context and quadratic attention, it can take minutes before the model starts generating text.

Flash Attention

The fundamental intuition is to avoid materializing the large N x N attention matrix, which requires quadratic reading/writing in the sequence length N. FlashAttention applies two techniques — tiling and recomputation. Tiling splits the input into blocks, loaded into fast GPU on-chip SRAM. Attention is computed block-by-block to avoid ma- terializing the entire matrix. Recomputation stores just enough information to reconstruct the attention matrix on- chip during backpropagation, avoiding storing the large intermediate. [3] The authors analyze the IO complexity, proving FlashAttention requires O(N²/ M) memory accesses versus O(N2) for standard attention, where M is the SRAM size. This IO-awareness allows FlashAttention to run faster despite increased FLOPS from recomputation

Experiments validate the speedups — Flash Attention trains BERT 15% faster than the MLPerf record, GPT-2 3x faster, and Long Range Arena 2.4x faster. This idea was further developer in FlashAttention-2. The improvements focus on enhancing parallelism across sequence blocks and optimizing work partitioning between thread blocks and warps on GPUs. Key techniques include reducing non-matrix multiply opera- tions, partitioning attention computa- tion across threads to increase occu- pancy, and distributing work between warps to reduce shared memory traffic. Empirical validation shows FlashAttention-2 achieves around 2x speedup over FlashAttention, reaching up to 73% of theoretical peak FLOPs on A100 GPUs. When used to train GPT models end-to-end, training throughput reaches 225 TFLOPs/s per A100, translating to 1.3x faster training than FlashAttention. The improvements promise to enable training models on much longer sequences than before at a similar cost. Accelerating attention speeds up inference and training, but fitting text into the model while maintaining high output quality remains an issue.

SO let’s now talk about LongLoRA:

LongLoRA. This innovative fine-tuning method revolutionizes the context capacity of LLMs, all without the staggering computational demands. Traditionally, expanding the context size of these models has been an arduous and resource-intensive process. For instance, training an LLM with an 8192-length context requires a mind- boggling 16 times more computational resources compared to a 2048-length context. But LongLORA is set to change the game by offering a cost-effective approach to super-sizing LLMs. A Paradigm-Shifting Training Method LongLORA’s development hinges on two groundbreaking approaches. First, it leverages the power of sparse local attention, specifically the shift short attention (S2-Attn) technique, during the fine-tuning process. This strategic move efficiently extends the model’s context, all while delivering substantial computational savings and maintaining performance levels akin to traditional fine-tuning with standard attention.

The second approach involves a reex

amination of the parameter-efficient fine-tuning strategy for context expansion. The research findings underscored the effectiveness of LORA when combined with trainable embeddings and normalization. LongLORA boasts impressive empirical results across various tasks, employing LLAMA2 models ranging from 7B/ 13B to a staggering 70B. The context expansion achieved by LongLORA is truly remarkable, stretching from 4k to an astounding 100k for LLaMA2 7B or 32k for LLAMA2 70B, all achievable on a single 8× A100 machine. Notably, LongLORA seamlessly integrates with existing techniques, including the versatile FlashAttention-2.

Practicality Enhanced: Meet the LongQA Dataset To further enhance the practicality of LongLORA, the research team has developed the LongQA dataset for supervised fine-tuning. This extensive dataset comprises over 3,000 question- answer pairs, each embedded in lengthy contexts. It serves as a valuable resource for fine-tuning exercises and reinforces LongLORA’s real-world applicability. Crucial Insights Uncovered Long-sequence Language Modeling: The study conducted an exhaustive evaluation using Proof-pile and PG19 datasets. The results unequivocally demonstrate that models trained with longer context sizes outperform their counterparts. Simply put, more information during training leads to superior results. For instance, a model’s performance skyrocketed from

P2.72 to an impressive 2.50 in terms of perplexity when the context window size increased from 8192 to 32768. Maximum Context Length: The research also delved into the limits of context length a single machine could handle. Even when stretched to accommodate incredibly long contexts, the models maintained commendable perform- ance, albeit with a slight dip in smaller context sizes. Retrieval-based Evaluation: In addition to language modeling, the research evaluated the models in tasks involving the retrieval of specific topics in lengthy conversations. Remarkably, these models stood toe-to-toe with state-of-the-art counterparts, often surpassing them. Notably, they demonstrated superior adaptability to open-source data, outperforming the competition. The Significance of Context Length

Recent discussions surrounding language models, such as LLaMA and Falcon, have shifted the focus from merely increasing model parameters to considering the number of context tokens or context length. LongLoRA’s emergence underscores the pivotal role context length plays in the evolving landscape of language models, providing a cost-effective avenue for expanding their capabilities.