Quantization on Intel Gaudi Series AI Accelerators
Intel Neural Compressor v3.0 Supports Quantization across Intel Hardware
Tai Huang, Wenhua Cheng, Xin He, Suyue Chen, and Haihao Shen, Intel Corporation
We are proud to announce the release of Intel Neural Compressor v3.0. As a major release, it includes several key features, including support for Intel CPUs, GPUs, and AI accelerators based on a new framework extension API.
FP8 Quantization
As Intel Neural Compressor becomes the official quantization tool for Intel Gaudi series AI accelerators, FP8 quantization has been added to v3.0. This offers significant performance and accuracy improvement for model inference, especially on popular large language models (LLMs). You can refer to this example code.
INT4 Model Loading
INT4 weight-only quantization (WOQ) is another popular compression technique for LLMs. Intel Neural Compressor already supported several state-of-the-art WOQ approaches. Many INT4 WOQ models are available at Hugging Face. In v3.0, INT4 WOQ models can be loaded locally or from Hugging Face model hub for inference on Gaudi. The tool is built into the official Gaudi v1.17 docker image for easy model quantization and measurement.
Framework Extension API
Intel Neural Compressor v3.0 introduces a new framework extension API for quantization, mixed precision, and benchmarking. The new API provides separate interfaces for mainstream deep learning frameworks like PyTorch and TensorFlow and brings better usability and native experience to AI developers who are already familiar with a particular framework.
Also in this version, a separate installation package for each framework is offered to new API users and its dependencies can be automatically installed with the package.
Accuracy-Aware FP16 Mixed Precision on Intel Xeon 6 Processors
In 2.x version, Intel Neural Compressor supported BF16 mixed precision on Xeon series processors and provided an accuracy-driven tuning function to reduce accuracy loss by fallback to FP32 when needed. The Intel Xeon 6 processor (codename Granite Rapids) supports the FP16 instruction set architecture (ISA) for Intel Advanced Matrix Extensions (Intel® AMX-FP16), so Intel Neural Compressor v3.0 expands mixed precision support to FP16 so that users can easily benefit from this new hardware acceleration on model inference.
Client-Side Quantization
The 4-bit WOQ model is especially useful for client-side LLM because model inference is heavily constrained by system memory and compute capability. However, WOQ on the client faces the same problem. In v3.0, WOQ is accelerated using a more efficient packing method. The memory footprint is also optimized with layer-wise/block-wise quantization. Furthermore, client-side quantization usability was improved by auto-detecting the system type and adopting a lightweight configuration, e.g.:
from neural_compressor.torch.quantization import get_default_rtn_config, convert, prepare
from neural_compressor.torch import load_empty_model
model_state_dict_path = "/path/to/model/state/dict"
float_model = load_empty_model(model_state_dict_path)
quant_config = get_default_rtn_config()
prepared_model = prepare(float_model, quant_config)
quantized_model = convert(prepared_model)
You can find more details in this guide, which provides different quantization options as well as example code, rough quantization times, and peak memory consumption for typical models.
AutoRound v0.3
We also included the latest version of AutoRound in this release, which adds the following:
- Broader Device Support: Expanded support for CPU, HPU, and CUDA inference in the AutoRound format. Previous 2-bit accuracy issues are also resolved.
- New Recipes and Model Releases: Published numerous recipes on the Low Bit Open LLM Leaderboard showcasing impressive results on Llama 3.1 and other leading models.
- Experimental Features: Introduced several experimental features, including activation quantization and mx_fp with promising accuracy.
- Multi-modal Model Support: Extended capabilities for tuning and inference across several multi-modal models.
Summary
We are delighted to make Intel Neural Compressor v3.0 available to public. Especially for new FP8 quantization, we encourage you to try it out on Intel Gaudi series AI accelerator. You are also welcome to create pull requests or submit issues through Intel Neural Compressor GitHub. We look forward to your feedback and contributions.