LLaMA3 suffers highly from Quantization Degradation

Aakash Varma
2 min readMay 16, 2024

LLaMA3’s weights are trained on 15 trillion tokens, allowing it to capture complex data relationships and utilize even the smallest decimals in the BF16 datatype. In contrast, the older LLaMA2 models were trained on significantly fewer tokens (around 2 trillion), resulting in more unused capacity. Consequently, quantizing the LLaMA3 models leads to substantial performance degradation due to information loss introduced by the quantization process.

This paper explores both PTQ (Post Training Quantization) and LoRA-FT (Finetuning) methods for evaluation on LLaMA3 model

PTQ Results

LoRA-FT Results

The key finding is that applying low-rank finetuning on the Alpaca dataset does not improve the performance of quantized LLAMA3–8B models under LoRA-FT quantization. In fact, it appears to worsen the degradation in performance caused by quantization. This is evident from the fact that various LoRA-FT quantization methods applied to 4-bit quantized LLAMA3–8B models result in worse performance compared to their non-LoRA counterparts at the same bit-width.

References

[1] paper: [2404.14047] How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study (arxiv.org)

--

--