8-bit Quantization On Pytorch

Hongze
AI2 Labs
Published in
3 min readJun 11, 2020

What is Quantization?

Quantization refers to a technique that uses fewer bits than floating-point precision for calculation and storage. A quantized model uses integer tensor instead of floating-point tensor to perform some or all of the operations. This is a more compact model representation and can take advantage of high-performance vector operations on many hardware platforms. PyTorch supports INT8 quantization. Compared to FP32, the model size is reduced by 4x, and the memory bandwidth requirement is also reduced by 4x. Hardware support for INT8 operation makes its calculations generally 2–4 times faster than FP32. Quantization is mainly a technique to accelerate inference, and the operation after quantization only supports forward calculation.

PyTorch supports multiple quantization methods for deep learning models. In most cases, the model is trained using FP32 and then converted to the INT8 model. In addition, PyTorch also supports the training of quantization perception, which can model the errors that occur in the quantization process and perform forward and reverse calculations through the fake-quantization module. It should be noted that all calculations are performed on floating-point numbers. At the end of quantitative perception training, PyTorch provides a conversion tool to convert the trained model to lower accuracy.

At a lower level, PyTorch provides a way to represent quantized tensors and use them for calculations. These tensors can be used to build models directly and perform all calculations with low accuracy. It also provides a high-level API that includes a typical workflow for converting from FP32 to a low-precision model with minimal loss of precision.

How to Quantize Tensors?

PyTorch provides both per-tensor and per-channel asymmetric linear quantization. Per-tensor means that all values ​​in the tensor are scaled in the same way. Per-channel means that for a given dimension, it is usually the channel dimension of the tensor, and each slice of the tensor in this dimension will use different scaling and offset (so that scaling and offset can be represented by vectors), this ensures that there are fewer errors in the quantization process.

The conversion process from floating point to fixed point uses the following mapping equation:

Q(x, scale, zero_point)=round(x/scale + zero_point)

It is worth noting that there is no loss in the representation of the floating-point zero before and after quantization. It means that the zero before quantization corresponds to a fixed-point value after quantization, which ensures that it will not be introduced quantization errors in some operations like Zero Padding.

In order to quantize in PyTorch, we need to be able to represent the quantized data with tensor. A quantized tensor can store quantized data (represented by int8/uint8/int32) and quantization parameters, such as scaling and quantization zeros. The quantized tensor can use multiple operations, so that the quantized tensor can easily perform operations, and can also be serialized and parallelized.

How to Quantize Models?

Pytorch provides three approaches to quantize models.

  1. Dynamic Quantization: This is the simplest to apply form of quantization where the weights are quantized ahead of time but the activations are dynamically quantized during inference. This is used for situations where the model execution time is dominated by loading weights from memory rather than computing the matrix multiplications. This is true for LSTM and Transformer type models with small batch size. Applying dynamic quantization to a whole model can be done with a single call to torch.quantization.quantize_dynamic().
  2. Post-Training Static Quantization: This is the most commonly used form of quantization where the weights are quantized ahead of time and the scale factor and bias for the activation tensors are pre-computed based on observing the behavior of the model during a calibration process. Post Training Quantization is typically when both memory bandwidth and compute savings are important with CNNs being a typical use case.
  3. Quantization Aware Training: In rare cases where post-training quantization does not provide adequate accuracy training can be done with simulated quantization using the torch.quantization.FakeQuantize. Computations will take place in FP32 but with values clamped and rounded to simulate the effects of INT8 quantization. The sequence of steps is very similar.

In the following articles, I will take some examples to explain how to do quantization in a real-world project. See you next time!

--

--

Hongze
AI2 Labs

Tian Hongze | AI Frontier | AI Practitioner | Yoozoo.{AI}