Comparative Analysis of LoRA Parameters on Llama-2 with Flash Attention

Drishti Sushma
8 min readSep 11, 2023

--

|Updated: 26/01/24

Objective of the Study

The key objective of this study was to analyze the effects of LoRA parameters namely r, lora_alpha, and lora_dropout on the performance of the Llama-2 with Flash Attention. Additionally, a secondary goal was to train a model that can generate instructions based on given inputs.

Dataset and Model Used

We utilized the databricks-dolly-15k which is an open-source dataset of instruction-following records (15k) generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. Moreover, we utilized “NousResearch/Llama-2–7b-hf” as the base model for our experiments.

LoRA

LoRA (Low-Rank Adaptation) is a technique aimed at fine-tuning large pre-trained models. In the world of machine learning, where ever-growing models have become the norm, it’s crucial to have a strategy that allows us to leverage the power of these models without starting from scratch. LoRA addresses this by using a special adaptation matrix during the fine-tuning process. This approach ensures that the core knowledge of the pre-trained model remains intact while permitting tailored adjustments to new data.

Understanding the Parameters of LoRA

To harness the full power of LoRA, it’s essential to grasp the significance of its key parameters. These parameters control how the adaptation matrix behaves and, consequently, the nature of the fine-tuning process.

  1. lora_alpha: This parameter controls the scaling of the low-rank matrices used in the adaptation. It acts as a scaling factor, determining how much influence the low-rank matrices have during the fine-tuning process.

Theoretical Implications:

  • Adjusting lora_alpha is a balancing act between leveraging the knowledge contained in the pretrained model and adapting it to a new task.
  • A higher lora_alpha value means greater influence of the adapted weights on the model's behavior, leading to more significant changes - which can potentially improve the model’s ability to adapt to the specific task it’s being fine-tuned for. A lower lora_alpha value means less influence, thereby maintaining more of the original pretrained model’s characteristics.
  • With a higher lora_alpha, the risk of overfitting may increase, especially if lora_alpha is set too high. This is because the model may adapt too strongly to the specific characteristics of the training data. A lower lora_alpha might help mitigate overfitting but could also lead to underfitting if it is too low.

2. r : The r parameter in LoRA denotes the rank of low-rank matrix used for adapting the original model's weight matrix during fine-tuning. This rank determines the size of the matrix, making the process efficient by using fewer parameters than the full weight matrix. The choice of r balances computational efficiency with the effectiveness of model adaptation.

Theoretical implications:

  • For smaller ranks, the number of parameters that need to be fine-tuned during adaptation is reduced. This can lead to faster training times, lower memory requirements, and reduced computational requirements - making it suitable for devices with computational constraints.
  • A higher rank matrix means more parameters will be adapted, potentially allowing for a better fit to new data. Conversely, a smaller rank matrix means fewer parameters will change, ensuring more of the pre-trained model’s original structure is preserved.
  • By restricting the rank of the adaptation matrix, one may prevent the model from overfitting to the fine-tuning data. This helps maintain the model’s generalization capability.

3. lora_dropout: This parameter refers to the dropout rate applied during the fine-tuning process. Dropout is a regularization technique where a proportion of neurons (or parameters) are randomly “dropped out” or turned off during training to prevent overfitting.

Theoretical Implications:

  • A higher dropout rate means more neurons are turned off during each training iteration.
  • While dropout can help prevent overfitting, setting it too high might lead to underfitting or the model not learning enough from the training data.
  • On the flip side, too low a dropout might make the model memorize the training data, reducing its generalization capabilities on unseen data.

Experimental Setup

  1. All tests were conducted on an A100 GPU.
  2. model_id = “NousResearch/Llama-2–7b-hf
  3. dataset = databricks-dolly-15k
  4. BitsAndBytesConfig Setup

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

5. LoRA Config

# LoRA config based on QLoRA paper
peft_config = LoraConfig(
lora_alpha=LORA_ALPHA,
lora_dropout=0.1,
r=64,
bias="none",
task_type="CAUSAL_LM",
)


# prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

6. Training Arguments

args = TrainingArguments(
output_dir= f"llama-7-int4-dolly-15k-flash-attn-r-64-lora-alpha-{LORA_ALPHA}",
num_train_epochs=3,
per_device_train_batch_size=6,
optim="paged_adamw_32bit",
logging_steps=200,
save_strategy="epoch",
learning_rate=2e-4,
bf16=True,
tf32=True,
lr_scheduler_type="constant",
disable_tqdm=True # disable tqdm since with packing values are in correct
)

Evaluating the Impact of r, lora_alpha, and lora_dropout on Model Performance

A) lora_alpha: The parameter lora_alphalora_alpha was adjusted across values [16, 32, 64, 128, 256]. Throughout this variation, we maintained lora_dropout=0.1 and r=64 consistently. The subsequent findings were observed:

Key Observations Obtained After Varying lora_alpha:

  1. Training Loss Trend: There’s a significant decrease in training loss as lora_alpha increases, transitioning from 1.2216 (α=16) to 1.1358 (α=256). This underscores that higher-rank matrices may better adapt to the data.
  2. Training Time Insight: Despite the increment in lora_alpha, the training time only varies minimally, oscillating between 33min 52s (α=16) and 34min 52s (α=256). This points to the LoRA technique’s efficiency even as we increase the rank matrix and the associated parameters.
  3. Inference Time Efficiency: A remarkable enhancement in inference time was noted with a rise in lora_alpha, plummeting from 3.48s (α=16) to 2.56s (α=256). Interestingly, even with an increase in parameters due to a higher lora_alpha, the inference was faster, possibly hinting at the model’s increased prediction confidence or inherent optimizations.

B) lora_dropout: For the experiment, we varied the value of lora_dropoutlora_dropout across [0.0, 0.1, 0.2, 0.3, 0.4, 0.5], while maintaining a constant setting for lora_alpha = 16 and r = 64. Here are the observed results:

Key Observations Obtained After Varying lora_dropout:

  1. Consistent Training Loss: The training loss remains relatively stable across varied dropout rates, staying around the 1.22 range.
  2. Minor Training Time Fluctuations: There’s only a slight difference in training time across the dropout values, ranging from 33min 27s (0 dropout) to 34min 52s (0.3 dropout).
  3. Inference Time Anomalies: Dropout rate of 0.4 leads to the highest inference time at 3.48s, yet a higher dropout rate of 0.5 results in the shortest time of 1.26s. The expected linear relation between dropout rate and inference time isn’t observed.
  4. Impact of lora_dropout: The minimal variation in training loss across the dropout values suggests that within the tested range, lora_dropout doesn’t greatly influence the model’s fit to the training data.
  5. Inconsistencies in Dropout Impact: The lack of a clear pattern in training and inference times with varying dropout values implies that the relationship between lora_dropout and these metrics is complex.

III) r: The value of r was varied [16,32,64,128,256,512] while keeping lora_alpha=16 and lora_dropout=0.1 constant and the following results were obtained:

Key Observations Obtained After Varying r:

  1. Consistent Training Loss: Variations in r result in only minor changes in training loss, suggesting that the choice of r may not significantly impact this metric.
  2. Training Time and r: Larger r values tend to increase the training time, highlighting the added computational cost with more parameters.
  3. Inference Time Variation: Despite expectations, higher r values do not consistently result in longer inference times. For example, r=32 and r=64 have notably different inference durations.
  4. Optimal Balance with r=32: Given the constraints of both training and inference time, r=32 appears to be the most efficient choice in this dataset.

So…Which LoRA Parameter has a More Pronounced Effect?

Well, the answer is lora_alpha!

The impact of lora_alpha is more pronounced in the following ways:

i) Training Loss Decline: As lora_alpha increases, there’s a clear and consistent reduction in training loss. This is particularly notable when comparing the loss at α=16 (1.2216) with that at α=256 (1.1358). This represents a significant decline and indicates better fine-tuning with higher lora_alpha values.

ii) Stable Training Time: Despite the improved training loss, the training time doesn’t increase dramatically with lora_alpha. The difference between the training times of the smallest and largest lora_alpha values (α=16 and α=256) is only one minute, demonstrating the efficiency of using a higher-rank matrix.

iii) Inference Time Improvement: The inference time reduces as lora_alpha increases, especially when transitioning from α=16 (3.48s) to α=64 (2.58s). This indicates that the model might be becoming more efficient or certain in its predictions as the rank of the matrix increases.

iv) Clear Trend: Unlike r and lora_dropout where the trends in training loss, training time, and inference time can seem somewhat erratic, lora_alpha presents a more predictable and discernible pattern, especially in terms of training loss and inference time.

v) Data Adaptation: A higher value of lora_alpha represents a higher-rank matrix, allowing for more parameters to be adapted during fine-tuning. The data suggests that as we increase the rank (and therefore the complexity) of the adaptation matrix, the model can better fit or adapt to the data, as seen by the reduction in training loss.

Future Work

  1. Scaling Experiments: Test the effects of LoRA parameters on larger-scale versions of the Llama-2 model to discern if there’s any change in the observed trends.
  2. Benchmarking with Other Models: Conduct a similar comparative analysis with other state-of-the-art models to determine if the observed effects of LoRA parameters on Llama-2 are universal or model-specific.

Conclusion

In examining the impacts of varying both r and lora_dropout parameters, a few salient observations emerge. Primarily, while different values of r have minimal impact on training loss, they do significantly influence training time, with larger values being more computationally intensive. Surprisingly, an increase in r doesn’t consistently lead to longer inference times, with r=32 standing out as the most efficient choice for this dataset. On the other hand, varying lora_dropout values showed that training loss remains consistent around the 1.22 mark across tested values. An interesting anomaly was observed in inference times with higher dropout rates, where an expected linear correlation was absent. Lastly, when adjusting the lora_alpha parameter, there is a noticeable improvement in training loss with higher values, suggesting that larger rank matrices might be more adaptable. Despite the additional parameters introduced by increasing lora_alpha, both training and inference times remained efficient, possibly indicative of the model’s enhanced confidence in predictions or inherent optimizations of the LoRA technique.

References

  1. https://www.philschmid.de/instruction-tune-llama-2
  2. https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms

--

--