Effective Post-Training Quantization for Large Language Models

Enhancing the SmoothQuant Approach to Quantization

Published in

Intel Analytics Software

5 min readApr 8, 2023

Yintong Lu, Xin He, Heng Guo, Wenhua Cheng, Chang Wang, Mengni Wang, and Haihao Shen, Intel Corporation

In this blog, we describe a post-training quantization technique for large language models with enhanced SmoothQuant approach. We also illustrate the usage and demonstrate the accuracy benefits. This method has been integrated into Intel Neural Compressor.

In this blog, we demonstrate an enhanced SmoothQuant approach to post-training quantization to improve large language models. This method has been integrated into Intel Neural Compressor, an open-source Python library of popular model compression techniques like quantization, pruning (sparsity), distillation, and neural architecture search. It is compatible with popular frameworks such as TensorFlow, the Intel Extension for TensorFlow, PyTorch, the Intel Extension for PyTorch, ONNX Runtime, and MXNet.

Large Language Models

Large language models (LLM) are trained on massive data sets and can have billions of weights. Their advanced network structures and larger number of parameters enable them to master the intrinsic complexity of natural language. Once trained, a LLM can be fine-tuned for a wide variety of downstream natural language processing (NLP) and natural language generation (NLG) tasks like conversational chatbots (e.g., ChatGPT), machine translation, text classification, fraud detection, and sentiment analysis.

LLM Deployment Challenges

LLM perform very well on NLP and NLG tasks but training and deploying such large models is complicated by the AI and Memory Wall: compute improves 3.1x while memory bandwidth only improves 1.4x every two years. Also, distributed systems are required to train LLM, which adds a network bandwidth challenge. After training, models are often deployed on systems with limited compute and memory resources. As such, reducing the size of LLM via post-training quantization is critical to make low-latency inference possible.

Quantization for LLM

Quantization is a common compression operation to reduce the memory footprint of a model and improve inference performance, which would make LLM deployment easier. Quantization converts the floating-point matrix to an integer matrix:

where X_fp32, S and Z are the input matrix, scale factor, and integer zero point, respectively.

Our SmoothQuant documentation explains why per-channel quantization could not be applied for activation, even though it could lead to lower quantization loss. However, the quantization error loss of activation plays an important role in the accuracy loss of model quantization. To reduce the quantization loss of activations, lots of methods have been proposed, e.g.: SPIQ, Outlier Suppression, and SmoothQuant. These three methods share a similar idea to shift the difficulty from activation quantization to weight quantization, but they differ in how much difficulty is transferred.

Enhancing SmoothQuant

SmoothQuant introduces a hyperparameter α as a smoothing factor to calculate the per-channel scale and balance the quantization difficulty of activation and weight:

where j is the input channel index.

For most models, such as OPT and BLOOM, α = 0.5 is a well-balanced value to split the difficulty of weight and activation quantization. A larger α value could be used on models with more significant activation outliers to migrate more quantization difficulty to weights.

The original SmoothQuant aims to split the quantization difficulty of weight and activation by using a fixed value α for an entire model. However, as the distributions of activation outliers vary not only across different models but also across different layers within a model, we propose to obtain layer-wise optimal α values with the ability to tune automatically using Intel Neural Compressor.

Our method consists of five major steps (the pseudocode is shown below):

Hook input and output values of all layers using register_forward_hook.
Generate a list of α values given user-defined α range and step sizes.
Recalculate smoothing factor given an α value and adjust parameters (weights and activations).
Perform per-channel quantization_dequantization of weights and per-tensor quantization_dequantization of inputs to predict the layer-wise output corresponding to the given α value.
Calculate the mean-squared loss with respect to the actual output value, recover the adjusted parameters and save the layer-wise optimal α values.

Multiple criteria (e.g., min, max, and mean) are supported to determine the α value of an input LayerNorm operation of a transformer block. In our experiments, an α range of [0.3, 0.7] with a step_size of 0.05 is found to be well-balanced for the majority of models.

Two remarkable features of our method are that it is fully automated and it supports more fusing patterns than the original approach. A sample code of performing SmoothQuant α auto-tuning on the BLOOM-1b7 model is provided below:

Sample Code to Enable Enhanced SmoothQuant

The user only needs to pass a model_name and a data loader. Please note that we rely on Torch JIT to analyze the model. The user could set torchscript to True when loading a Hugging Face model or set the return_dict to False. Please refer to Intel Neural Compressor documentation for more information.

Results

A major advantage of our enhancement is improved accuracy. After evaluations over various popular LLMs, INT8 SmoothQuant auto-tuning provides better last-token accuracy with respect to the original INT8 SmoothQuant and FP32 baseline.

Accuracy of the FP32 baseline, INT8 with and without SmoothQuant, and INT8 (with our enhancement)

You can see that our enhancement shows 5.4% and 1.6% higher accuracy than the default SmoothQuant on the OPT-1.3b and BLOOM-1b7 models, respectively. The quantized model is also up to 4x smaller than the FP32 model, which significantly reduces the memory footprints.

Please see our GitHub repository for more comprehensive results. You are also welcomed to create pull requests or leave your comments on GitHub issues. We look forward to your feedback and suggestions.