Efficient Quantization with Microscaling Data Types for Large Language Models

New Quantization Recipes Using Intel Neural Compressor

Intel(R) Neural Compressor

Published in

Intel Analytics Software

4 min readMar 1, 2024

Mengni Wang, Xin He, and Haihao Shen, Intel Corporation

Introduction to the Microscaling Data Type

Numerous breakthroughs have emerged across various fields, such as text analysis, language translation and chatbot technologies, fueled by the development of large language models (LLMs). Nevertheless, their increasing power comes with the challenge of explosive growth in parameters, posing obstacles for practical use. To balance memory limits and accuracy preservation for AI models, the Microscaling (MX) specification was promoted from the well-known Microsoft Floating Point (MSFP) data type [1, 2]:

At an equivalent accuracy level, the MX data type demonstrates the ability to occupy a smaller area and incur lower energy costs for multiply-accumulate compared to other conventional data types on the same silicon [1].

MX Quantization in Intel Neural Compressor

Intel Neural Compressor (INC) is an open-source model compression tool that focuses on providing users with an easy-to-use interface and powerful model compression techniques like quantization, pruning, distillation, etc. We implement different quantization recipes for the MX data type in INC, provide an end-to-end pipeline for MX quantization and evaluation of LLMs, as well as validate the effectiveness of MX quantization with LLMs to show its benefits.

Currently, INC supports post-training quantization (PTQ) for INT8 and FP8, which can compress models to lower bits without training. Building upon the groundwork laid by others, INC seamlessly applies the MX data type to PTQ, offering meticulously crafted recipes to empower users to quantize LLMs without sacrificing accuracy. The memory and computational limits of LLMs are more severe than other general neural networks, so our exploration focuses on LLMs first. The following table shows the basic MX quantization recipes in INC and enumerates distinctions among various data types. The MX data type replaces general float scale with powers of two to be more hardware-friendly. It adapts a granularity falling between per-channel and per-tensor to balance accuracy and memory consumption.

The exponent (exp) is equal to torch.floor(torch.log2(amax)), MAX is the representation range of the data type, amax is the max absolute value of per-block tensor, and rmin is the minimum value of the per-block tensor.

Furthermore, we are exploring advanced recipes, such as specifying particular modules to be excluded from MX quantization in order to identify opportunities for improving accuracy.

INC provides an API so concise that users can obtain a fake quantized model with the MX data type using just three lines of code (see link for the example and link for its full capability):

from neural_compressor.torch import MXQuantConfig, quantize
quant_config = MXQuantConfig(weight_dtype=args.weight.dtype, act_dtype=args.act_dtype)
user_model = quantize(model=user_model, quant_config=quant_config)

Validated Results

To showcase the accuracy preserving capabilities of the MX data type, experiments were conducted using various MX quantization recipes on nine popular LLMs. Accuracy was validated using the lambada_openai task with an accuracy target of 99% of the FP32 baseline.

Accuracy for Basic Quantization Recipes

To make the comparison fair, we apply the basic quantization recipes across all the data types without sophisticated calibration. The charts below show that the MX data type demonstrates a significant advantage in INT8 compared to conventional data types and serves as a complementary option to FP8.

Accuracy comparison among different data types

MX FP6 achieves results similar to MX FP8, which means we can compress models by lower bits with negligible accuracy loss.

Accuracy comparison among different MX data types

Accuracy for Advanced Quantization Recipes

The benefits of some advanced quantization recipes are shown as below. Accuracy is further improved with quantizing weight only and excluding last linear, which has motivated us to explore the other potential recipes.

Accuracy comparison among different recipes on databricks/dolly-v2–3b

Conclusions and Future Work

We have simulated MX quantization in INC and validated some recipes on LLMs. In future, we will explore new MX quantization recipes and its low bits capability. Furthermore, performance benchmarking will be conducted once the hardware and kernels are ready. You can star our repository to stay informed about the latest updates promptly.

We encourage you to use INC to quantize LLMs to get better inference efficiency on Intel platforms, and explore new quantization techniques for new data types on future platforms. Your feedback, suggestions, or questions are always welcome and highly appreciated.

References

[1] Bita Darvish Rouhani et al. (2020), Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point, Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 10271–10281.

[2] OCP Microscaling Formats (MX) Specification