The AutoRound Quantization Algorithm

Weight-Only Quantization for LLMs Across Hardware Platforms

Published in

Intel Analytics Software

4 min readApr 2, 2024

Wenhua Cheng and Hanwen Chang, Intel Corporation

We recently released AutoRound, an innovative weight-only quantization algorithm designed specifically for low-bit LLM inference. It approaches near-lossless compression for a range of popular models including gemma-7B, Mistral-7b-v0.1, Mistral-7B-Instruct-v0.2, Mixtral-8x7B-Instruct-v0.1, Phi2, LLAMA2 , Qwen1.5–7B-chat, and more. AutoRound consistently outperforms other methods (GPTQ, AWQ, OmniQuant, and HQQ) in many scenarios at W4G128, W4G-1, W3G128, and W2G128. Additionally, AutoRound does not introduce any extra overhead during inference.

Our method builds on prior innovations. It adopts signed gradient descent (SignSGD) to fine-tune rounding values and minmax values of weights in fewer steps. An overview is shown in the diagram below, in which V is the rounding perturbation in the [-0.5, 0.5] range, while alpha and beta are tunable scales of min-max weights. We typically set the range to [0.5, 1]. Opting for SignSGD is inspired by the well-defined boundaries of the solution space, which offer several advantages.

Some of the key features of AutoRound are as follows:

Device compatibility: It is compatible with quantization devices including Intel Guadi2, Intel CPUs, and Nvidia GPUs.
Wide model support: It is suitable to a diverse range of model families. About 20 model families have been verified so far.
Export flexibility: Effortlessly export quantized models to ITREX[5] and AutoGPTQ formats for seamless deployment on Intel CPU and Nvidia GPU platforms.
Quantized models/recipes: Improved performance using model-specific recipes. Several per-quantized models and recipes have been published.

Comparison with Other Methods

We provide a comparative analysis under fair setting in Comprehensive Accuracies Data, some of which is shown in the following tables. The evaluation was done on the qdq fake model with lm-eval v0.3. Our approach achieved superior performance compared to GPTQ, scoring 30/32, AWQ with 27/32, HQQ with 15/16, and OmniQuant with a perfect score of 16/16 across llamv1/llamav2/mistral-7b on W4G-1, W4G128, W3G128, and W2G128, based on the average accuracies of 11 zero-shot tasks. W4G128 denotes quantizing weights using 4 bits and 128 weights share a single quantization scale and zero point. And G-1 denotes all weights in a input channel shares the same scale and zero point.

With 512 calibration samples, the tuning cost are comparable while HQQ is much faster and OmniQuant is clearly slower. Note that our performance improves with more tuning steps and model-specific hyperparameters, as demonstrated in the next section.

Quantized Models/Recipes

We quantized several models using more tuning steps and model-specific hyperparameters:

Intel/neural-chat-7b-v3–3-int4-inc (FP16 0.6778, INT4 0.6748 )
Intel/neural-chat-7b-v3–1-int4-inc (FP16 0.6769, INT4 0.6721)
Intel/Mistral-7B-v0.1-int4-inc (BF16 0.6306, INT4 0.6308 )
Intel/phi-2-int4-inc (FP16 0.6155, INT4 qdq 0.6163)
Intel/falcon-7b-int4-inc (FP16 0.5521, INT4 qdq 0.5507)
Intel/gemma-2b-int4-inc (FP16 0.5383, INT4 0.5338)
mistralai/Mistral-7B-Instruct-v0.2 recipe (BF16 0.6647, INT4 0.6621)
google/gemma-7b recipe (FP16 0.6239, INT4 0.6307)
google/gemma-7b-it recipe (FP16 0.6022, INT4 0.6017)
mistralai/Mixtral-8x7B-Instruct-v0.1 recipe (BF16 0.7000, INT4 0.6977 )
mistralai/Mixtral-8x7B-v0.1 recipe (BF16 0.6698, INT4 0.6633)
meta-llama/Llama-2–7b-chat-hf recipe (FP16 0.5901, INT4 qdq 0.5897)
Qwen/Qwen1.5–7B-Chat recipe (BF16 0.6231, INT4 0.6205)

Some have been uploaded to the HF model space, while others are still under review. We evaluate most models using the average accuracy of 11 tasks. In the case of Chinese models, we adhere to the Qwen approach, which involves using the average accuracy of four tasks. All items labeled without ‘qdq’ are evaluated using a real quantized model with lm-eval 0.4, while the remaining items are evaluated using a simulated qdq model due to certain evaluation challenges encountered with lm-eval 0.4.

Usage

With just a few lines of code or by using the user-friendly AutoRound example, you can quickly obtain quantized and compressed models.

Model Quantization

##pip install auto-round
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

from auto_round import AutoRound

bits, group_size, sym = 4, 128, False
## device="auto", "hpu" or "cpu" or "cuda"
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, sym=sym, device=None)
autoround.quantize()
output_dir = "./tmp_autoround"
autoround.save_quantized(output_dir)

Model Inference

from intel_extension_for_transformers.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
 
quantized_model_path = "./tmp_autoround"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path, device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(quantized_model_path, use_fast=True)

text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))

Model Inference on Nvidia GPU with AutoGPTQ

from transformers import AutoModelForCausalLM, AutoTokenizer

quantized_model_path = "./tmp_autoround"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path, device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(quantized_model_path, use_fast=True)

text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))

Concluding Remarks

AutoRound/HQQ/GPTQ/AWQ has also been implemented in Intel Neural Compressor and is supported in Intel Extension for Transformers to enhance compatibility with Intel devices. For more details, please kindly refer to https://github.com/intel/auto-round.