Effective Weight-Only Quantization for Large Language Models with Intel Neural Compressor

Quantize Large Language Models with Just Few Lines of Code

Intel(R) Neural Compressor
Intel Analytics Software
3 min readSep 17, 2023

--

Mengni Wang, Xin He, Yuwen Zhou, Yiyang Cai, Kaokao Lv, Suyue Chen, Wenhua Cheng and Haihao Shen, Intel Corporation

As large language models (LLMs) become more prevalent, there is a growing need for quantization methods maintain accuracy while reducing computational costs. Compared to traditional INT8 quantization on both activation and weight, weight-only quantization (WOQ) is a better tradeoff to balance performance and accuracy.

To support WOQ quantization, Intel Neural Compressor provides unified APIs for state-of-the-art approaches like GPTQ [1], AWQ [2], and TEQ [3] as well as the simple yet effective round-to-nearest (RTN) approach:

Besides the basic support from the original algorithms, Intel Neural Compressor has made considerable enhancements in terms of quantization productivity (e.g., model coverage and new hardware support), thereby helping customers accelerate LLM inference deployment, e.g.:

  • AWQ: improved model and architecture coverage with expanded hardware support
  • GPTQ: improved model and architecture coverage with more comprehensive calibration support
  • TEQ: a new approach inspired by AWQ with a trainable equivalent transformation that searches for the optimal quantization scaling factors

Intel Neural Compressor provides default quantization APIs for beginners plus more flexible APIs for advanced users. The following sample code shows how to enable WOQ quantization:

Refer to the documentation for detailed WOQ capabilities.

We validated 20+ LLMs on PyTorch and ONNX Runtime with 4-bits WOQ, and all models reach comparable or even better accuracy than traditional INT8 quantization:

Accuracy results for Llama 2 models

The accuracy and perplexity is measured by Lambada-OpenAI, a popular dataset available in LM-Evaluation-Harness. We can see from the previous table that INT4 accuracy has met 99% of FP32 accuracy for all the Llama-2 models. Moreover, INT4 models reduce model size by up to 8x, making LLM inference on memory-constrained devices (e.g., clients) possible and generative AI more accessible to everyone. For more details of all the validated models, please refer to this link for PyTorch models and this link for ONNX models.

We recently released Intel Neural Compressor v2.3, offering the WOQ features described above. If you are looking for effective LLM quantization, we encourage you to have a try. You can also submit pull requests, issues, or questions to https://github.com/intel/neural-compressor. Visit Intel Neural Compressor to learn more and get started.

We are committed to providing state-of-the-art LLM quantization techniques in Intel Neural Compressor, and continue exploring new quantization recipes. The source code of SignRound [4], one of our works, will be publicly available soon.

Important notes: INT4 Llam2 ONNX models are available in Hugging Face (direct download links shown below):

Reference

[1] Frantar, Elias, et al. “Gptq: Accurate post-training quantization for generative pre-trained transformers.” arXiv preprint arXiv:2210.17323 (2022).

[2] Lin, Ji, et al. “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.” arXiv preprint arXiv:2306.00978 (2023).

[3] Cheng, Wenhua, et al. “TEQ: Trainable Equivalent Transformation for Quantization of LLMs”, preprint under review (2023).

[4] Cheng, Wenhua, et al. “Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs.” arXiv preprint arXiv:2309.05516 (2023).

--

--