Analyzing the Impact of lora_alpha on Llama-2 Quantized with GPTQ

Drishti Sushma
3 min readSep 14, 2023

--

|Updated: 26/01/24

Objective of the Study

The key objective of this study was to analyze the effects of LoRA parameters namely r, lora_alpha, and lora_dropout on the performance of the Llama-2 with Flash Attention.

Dataset and Model Used

We utilized the Abirate/english_quotes dataset from Hugging Face which has all the quotes retrieved from goodreads quotes. This dataset can be used for multi-label text classification and text generation. Moreover, we utilized “TheBloke/Llama-2–7b-Chat-GPTQ” as the base model for performing experiments.

LoRA

LoRA (Low-Rank Adaptation) is a method employed for fine-tuning extensive pre-existing models. In the evolving landscape of machine learning with increasingly large models, LoRA offers an efficient mechanism to harness their capabilities, leveraging an adaptation matrix during fine-tuning. This technique preserves the foundational knowledge of the pre-trained model while enabling specific modifications for new datasets.

Experimental Setup

  • All experiments were performed using A100 GPU.
  • LoRA Configuration: r=8, lora_alpha=64, and lora_dropout=0.10.
  • Target Modules: ["k_proj","o_proj","q_proj","v_proj"].
  • Training Batch Size: 1
  • Evaluation Batch Size: 1
  • Evaluation Strategy: Evaluated every 20 steps with a maximum of 100 steps.
  • Learning Rate: 2e-4
  • optim=”adamw_hf”
  • max_steps = 100

Evaluating the Impact of lora_alpha on Model Performance

  1. Optimal lora_alpha Value:lora_alpha = 32 seems to provide the best balance in terms of training and validation loss. It offers a training loss of 3.8675 and validation loss of 4.2374 at step 100. These are among the lowest values for the metrics in this dataset, indicating optimal performance.
  2. Loss Fluctuations: As the value of lora_alpha increases from 16 to 32, there's a significant improvement in both training and validation loss. However, as we move beyond lora_alpha = 32, there's a tendency for the performance to decline. This could be due to overfitting or the model becoming too complex to adapt efficiently for the given dataset.
  • It’s also notable that for lora_alpha values 64 and 128, both training and validation losses remain identical, suggesting that the impact of this parameter plateaus at this range before deteriorating further.

3. Training Runtime: The average training runtime remains relatively consistent across varying lora_alpha values, with only slight variations. This suggests that, in terms of computational time, there isn't a significant penalty or benefit tied directly to the adjustment of the lora_alpha.

4. Overall Training Loss: The overall training loss seems to be the most optimal for lora_alpha = 32 and starts to increase as we move to higher values. This could indicate that while the model might be adapting and capturing more nuances, it might also be getting overcomplicated or losing generalizability.

Conclusion

For the Abirate/english_quotes dataset and TheBloke/Llama-2–7b-Chat-GPTQ, lora_alpha at 32 yielded the lowest training (3.8675) and validation losses (4.2374) at step 100, demonstrating its superior performance. Notably, as lora_alpha rises beyond 32, performance tends to decline, potentially due to overfitting. Interestingly, identical losses were observed for lora_alpha values 64 and 128, indicating a performance plateau. Despite varying lora_alpha values, training runtimes remained consistent, suggesting no substantial computational impact. The upward trend in overall training loss at higher lora_alpha values further hints at reduced model generalizability.

--

--