Model Quantization 2: Uniform and non-Uniform Quantization

Florian June
6 min readOct 31, 2023

--

In the previous article, the basic concepts of model quantization were introduced. This article mainly focuses on two main quantization methods: Uniform Quantization and non-Uniform Quantization.

We assume that we have well-trained model parameters θ, stored in floating-point precision. In quantization, the goal is to reduce the precision of the parameters (θ) and intermediate activation values to low precision while minimizing the impact on the model’s generalization/accuracy.

To achieve this, we need to define a quantization operator: Q = g(x)

This quantization operator maps the floating-point value x to a quantized value Q.

As shown in Figure 1, quantization can be divided into Uniform Quantization and non-Uniform Quantization based on the uniformity of the distribution of the range of values Q, which are marked with the orange bullets.

Figure 1

The left side of Figure 1 represents uniform quantization because the resulting quantized values are uniformly spaced.

The right side of Figure 1 represents non-uniform quantization because the quantized values are not necessarily uniformly spaced.

Uniform Quantization

Let [β, α] be the range of representable real values chosen for quantization and b be the bit-width of the signed integer representation.

Uniform quantization transforms the input value x∈ [β, α] to lie within

where inputs outside the range are clipped to the nearest bound.

For uniform transformations, there are only two choices for the transformation function: f(x) = s · x + z and its special case f(x) = s · x, where x, s, z ∈ R.

As shown 1 figure 1, these two choices are also called affine and scale, respectively:

Affine Quantization

Affine quantization maps a real value x∈R to a b-bit signed integer Q,

For the affine transformation function

the definitions of scale factor s and zero point z are as follows:

where [β, α] denotes the clipping range, a bounded range that we are clipping the real values with, and b is the quantization bit width.

The purpose of zero point is to find the corresponding position of 0 in the domain of x within the range of values, this can also be observed in the left side of figure 1.

The quantize operation is defined as follows:

Based on the AffineQuantize operation, the AffineDeQuantize operation is very clear, which computes an approximation of the original real valued input:

With the definitions and formulas provided above, I have written the following program to test affine quantization:

import numpy as np

def get_s_z(b, beta, alpha):
s = (2 ** b - 1) / (alpha - beta)
z = -np.round(beta * s) - 2**(b-1)
return s, z

def quantization(x, b, beta, alpha):
s, z = get_s_z(b, beta, alpha)
return np.clip(np.round( s * x + z), a_min= -2 ** (b-1), a_max = 2 ** (b-1) - 1)

def dequantization(Q, b, beta, alpha):
s, z = get_s_z(b, beta, alpha)
return (1 / s) * (Q - z)

x = [-1.8, -1.0, 0, 0.5]
Q = [quantization(a, 8, -1.8, 0.5) for a in x]
print(Q)

x_hat = [dequantization(a, 8, -1.8, 0.5) for a in Q]
print(x_hat)

Output is:

(py37) $ python quant.py 
[-128.0, -39.0, 72.0, 127.0]
[-1.803921568627451, -1.0011764705882353, 0.0, 0.49607843137254903]

We can see that the input[-1.8, -1.0, 0, 0.5] has been quantized to [-128.0, -39.0, 72.0, 127.0]. Then, [-128.0, -39.0, 72.0, 127.0] has been dequantized to [-1.803921568627451, -1.0011764705882353, 0.0, 0.49607843137254903].

Scale Quantization

Scale quantization performs range mapping with only a scale transformation. It is commonly referred to as symmetric quantization, where the input range and integer range are symmetric around zero. This means that for int8 we use the integer range [−127, 127], opting not to use the value -128 in favor of symmetry. Figure 1b illustrates the mapping of real values to int8 with scale quantization.

For the scale transformation function

the definitions of scale factor s and scale quantization operator are as follows:

for a real value x, scale quantization choosed representable range [−α, α], producing a b-bit integer value Q.

We can see that Scale Quantization does not have a zero point z, and β = -α, it is actually a special case of Affine Quantization, so examples can refer to the examples of affine quantization.

Non-Uniform Quantization

The formal definition of non-uniform quantization is as follows:

where qi represents the discrete quantization levels and ∆i the quantization steps (thresholds), specifically, when the value of a real number x falls in between the quantization step ∆i and ∆i+1, quantizer f projects it to the corresponding quantization level qi . Note that neither qi’s nor ∆i’s are uniformly spaced.

A typical rule-based non-uniform quantization method is to use a logarithmic distribution, where the quantization step size and levels increase exponentially rather than linearly. Another popular approach is binary code-based quantization.

Non-uniform quantization can achieve higher accuracy under a fixed bit length because it can better capture important value regions or find suitable dynamic ranges to better capture distributions.

Generally speaking, non-uniform quantization allows us to better capture signal information by non-uniformly allocating bits and discretizing parameter ranges. However, non-uniform quantization schemes are often difficult to efficiently deploy on general-purpose computing hardware such as GPUs and CPUs. Therefore, uniform quantization is currently the de facto method due to its simplicity and effective hardware mapping.

Conclusion

This article introduces two main quantization methods: Uniform Quantization and non-Uniform Quantization.

Uniform Quantization can be divided into affine and scale, where scale is a special case of affine.

For non-uniform quantization, there are also many related studies, but uniform quantization is currently the de facto method due to its simplicity and effective hardware mapping.

If there is an opportunity in the future, I will specifically talk about non-uniform quantization.

Furthermore, the latest AI-related content can be found in my newsletter.

Lastly, if there are any errors or omissions in this article, please kindly point them out.

References

A Survey of Quantization Methods for Efficient Neural Network Inference

8-BIT OPTIMIZERS VIA BLOCK-WISE QUANTIZATION

INTEGER QUANTIZATION FOR DEEP LEARNING INFERENCE: PRINCIPLES AND EMPIRICAL EVALUATION

--

--

Florian June

AI researcher, focusing on LLMs, RAG, Agent, Document AI, Data Structures. Find the newest article in my newsletter: https://florianjune.substack.com/