[Research Paper Summary]A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

Mehnoor Aijaz

Published in

Athina AI

5 min readOct 4, 2024

Original Paper: https://arxiv.org/abs/2409.11055

By: Jemin Lee, Sihyeong Park, Jinse Kwon, Jihun Oh, Yongin Kwon

Abstract

Prior works that evaluate quantized LLMs only cover a handful of metrics, such as perplexity or some basic knowledge tasks on no longer relevant data.
However, the recent popularity of large-scale models such as Llama 3.1 (which has up to 405B parameters) is poorly understood.
To investigate the efficacy of instruction-tuned large language models, we quantize the downstream performance of GPTQ with other four quantization techniques, namely AWQ, SmoothQuant and FP8 in various model sizes starting from 7.3B to 405.1B parameters.
We measure performance on 13 benchmarks covering six different task types: natural language understanding, commonsense question answering, instruction following, hallucination detection and mathematics reasoning in addition to open domain conversation.
The key findings were that
Embed Freeze BOPS Designing the size of a larger LLM to match its smaller counterpart, FP16 LLM (not using it and quantised) usually wins more on most but hallucination detection and instruction following. ~2x Embedagogue (~2x EfficientPump).
Different quantification techniques, model dimensions or bit-width are likely to lead to a quite different performance of QNNs, although weight-only based methods usually show their superiority for larger models.
Task difficulty only has a non-significant small influence on quantization-induced accuracy degradation.
However, the MT-Bench evaluation methodology suffers from a lack of sensitivity to differentiate between the new state-of-the-art LLMs.

Summary Notes

Figure: Overall evaluation pipeline for quantized LLMs. The pipeline assesses instruction-tuned LLMs, including Vicuna, Gemma, and Llama families, with sizes ranging from 2B to 405B. Models are quantized using GPTQ, AWQ, SmoothQuant, and FP8 methods, and evaluated across 13 benchmarks designed to test complex knowledge, language understanding, thruthfulness, emergent abilities, and quality of free-form text generation. The multi-node cluster distributed inference environment comprises four servers (H100–80Gx8, A100–80Gx4, RTX 6000–48Gx4, and A6000–48Gx4), utilizing huggingface accelerate library and vLLM for fast, reliable evaluation.

Recent developments in machine learning and artificial intelligence have brought Large Language Models (LLMs) to the forefront of NLP.

However, the massive sizes of these models are a noteworthy bottleneck-deploying them in low resource settings.

A recent line of inquiry, inspired by the incentive for quantizing large language models (LLMs), seeks to investigate a much more attractive direction For the purpose of reducing their memory requirements and computational ones before even trying to retain respectable levels of performance, due to the competition.

In this study, we surveyed the latest research directions of quantizing instruction tuned large language models with up to 405 billion parameters.

Analyzing the Landscape: A Basis for Inquiry

This study investigates a significant issue: the effective compression of Large Language Models while maintaining their performance across diverse tasks.

This research examines various quantization techniques, specifically GPTQ, AWQ, SmoothQuant, and FP8.

The methodologies were implemented across models with parameter counts varying from 7 billion to 405 billion. A total of 13 benchmarks were utilized to evaluate six distinct task categories.

This study encompasses a variety of tasks, including commonsense question and answer, language comprehension, mathematical reasoning, and additional areas of inquiry.

Methodology: Techniques for Quantization and Evaluation Framework

Principal Techniques for Quantization

GPTQ (Gradient Post-Training Quantization): employs a layer-wise quantization approach that incorporates inverse Hessian information to adjust weights, with an emphasis on reducing accuracy loss.
AWQ (Activation-Aware Weight Quantization): This method emphasizes the importance of maintaining precision through the preservation of critical weights, while also leveraging activation magnitudes to achieve effective quantization.
SmoothQuant: represents a methodological advancement in the field of quantization by redistributing the complexity associated with quantization from activations to weights. This approach facilitates the implementation of 8-bit quantization for both components, thereby enhancing efficiency in computational processes.
FP8 (Floating Point 8-bit): This section discusses the characteristics and applications of FP8, a numerical representation format that utilizes 8 bits to encode floating-point values. The implications of using FP8 in computational processes are examined, highlighting its potential advantages in terms of memory efficiency and processing speed in various contexts. This approach utilizes specific FP8 formats for efficient quantization, leveraging the capabilities of contemporary hardware.

Assessment Framework

The evaluation pipeline was executed within a multi-node cluster environment, utilizing tools such as Huggingface’s accelerate library and vLLM.

The configuration enabled the evaluation of various models, encompassing the Vicuna, Gemma, and Llama families, with parameter sizes ranging from 2 billion to 405 billion.

Findings: Exploring the Range of Performance

The findings of the study offer a detailed perspective on the effects of quantization on the performance of large language models (LLMs).

Quantized Models vs. Smaller Models: Larger quantized models demonstrated superior performance compared to smaller full-precision models in the majority of benchmarks, with the exception of hallucination detection and instruction-following tasks.
Impact of Quantization Methods: The analysis indicates that weight-only quantization methods, such as AWQ, demonstrate superior accuracy preservation compared to approaches that incorporate both weights and activations. This trend is particularly evident in larger models, exemplified by Llama-3.1–405B.
Task Difficulty and Accuracy: The findings indicate that the level of task difficulty did not have a substantial impact on the decline in accuracy attributed to quantization, implying a degree of robustness across different levels of complexity.
Evaluation Limitations of MT-Bench: The findings suggest that this method possesses restricted ability to differentiate among high-performing LLMs, highlighting the necessity for the development of more advanced evaluation metrics.

Implications: Future Directions for Quantized LLMs

Practical Implementations

The results underscore the potential of quantization to enhance the accessibility of powerful large language models for practical applications, particularly in resource-constrained environments.

The deployment of robust models efficiently facilitates advancements in real-time language translation and enhances AI-driven customer support, thereby creating new opportunities in these domains.

Future Research Directions

Identifying gaps in the current literature and proposing new avenues for exploration can lead to significant contributions. It is crucial to consider interdisciplinary approaches and emerging technologies that may enhance understanding and application. Furthermore, collaboration among researchers can facilitate innovative methodologies and broaden the scope of inquiry. Overall, a strategic focus on future research directions will be vital for the

The research identifies specific limitations, including the necessity for enhanced evaluation methodologies, exemplified by MT-Bench.

Subsequent investigations may focus on the creation of more extensive benchmarks and the effects of quantization on additional emergent capabilities of large language models.

In conclusion, this study advocates for a balanced approach to model compression, emphasizing the importance of maintaining performance while reducing resource consumption. The findings suggest that careful consideration of various compression techniques can lead to effective optimization of machine learning models without significant loss of accuracy.

Quantized instruction-tuned large language models offer an effective strategy for balancing the relationship between model size and performance outcomes.

This study highlights the potential of quantization as a means to enhance accessibility to advanced AI capabilities, despite existing challenges, especially in areas such as instruction-following.

The ongoing advancement in artificial intelligence necessitates a thorough examination and enhancement of quantization techniques. These techniques are essential for effectively reducing model sizes while preserving their inherent capabilities.

This study provides a robust framework for engineers and researchers aiming to explore this intricate and beneficial domain.

Feel free to check out more blogs, research paper summaries and resources on AI by visiting our website.