Zero-point quantization : How do we get those formulas?

Motivation behind the zero-point quantization and formula derivation, giving a clear interpretation of the “zero-point”

7 min readJan 27, 2024

Quantization techniques are used to be able to use neural networks when there are constrains on memory or computation power, by trading some precision. Neural networks weights are usually trained in memory and computation-heavy floating point representations. With Quantization, we transform the weights into low precision representations (usually 8-bits), which use significantly less memory.

Zero-point quantization is a technique that transforms the original floating point range into an 8-bit range (INT8). If you were to read articles like this one, you would be told that the way to convert a tensor X to INT 8 is:

typical formula for calculating the scale and zero-point

Formulas for Quantizing and Dequantizing a tensor

But, where do these formulas come from? In this article, I show the motivation behind the zero-point quantization technique, and derive the formulas used to calculate it.

Motivation for zero-point Quantization

At its core, zero-point Quantization uses re-scaling and shifting to project the floating point range of our tensors into the 8-bit range. What are re-scaling and shifting? Well, it is basically the same we do to convert between degrees Celsius and degrees Fahrenheit:

Here, the scaling factor is equal to 1.8 and the shifting factor (or offset) is equal to 32. This is how we convert from the Celsius scale to the Fahrenheit scale.

In the case of Quantization, we start with a tensor whose values we want to project to the 8-bit range. Using 8-bits, we can only store up to 256 numbers. The INT8 representation typically used for quantization can store the 256 integer numbers between -128 and 127.

(check that in the interval from a number m to a number n, there are n-m+1 numbers, and 127-(-128)+1 = 256).

The question now becomes, given a tensor X, how do we transform it to fit in the range from -128 to 127?

Remember that a tensor is simply an ordered collection of (floating point) numbers, its components:

Any collection of numbers has a minimum number and a maximum number. Zero-point quantization considers a new scale, whose bottom is the minimum component of the tensor, and whose top is the maximum component, and projects it to the 8-bit range.

Given a tensor, zero-point quantization considers a new scale, whose bottom is the minimum component of the tensor, and whose top is the maximum component, and projects it to the 8-bit range.

We take the minimum and maximum of the components of the tensor, and use them as the endpoints of a new scale, which we have to project to the integer range from -128 to 127. That is, let:

To perform this transformation, we need to find a re-scaling factor and an offset:

This kind of transformation is colloquially called a linear transformation, or, if you are really rigorous, the literature calls it an affine transformation.

Deriving the formulas for zero-point Quantization

How do we find the scaling factor and offset we need? We can use the fact that the value at the bottom of the scale on the left has to be mapped to the bottom of the scale on the right, and the same applies to the value at the top of the scale. Applying equation (1) to both endpoints gives us the following equations:

If we subtract equation (2) from equation (3), we obtain:

That is, rearranging equation (4), we obtain the re-scaling factor:

Now, to obtain the offset, we can use again equation (2):

And plugging-in the value we just obtained for the re-scaling factor:

These look very similar to the initial equations shown at the top of the article, however, we are not quite there yet, there are some details we need to take care of.

Dealing with rounding

Notice that in equation (1) we are projecting into the integer range from -128 to 127:

This means that the quantities on the right-hand side of the equation need to be integer numbers. In zero-point Quantization, we start by rounding the offset:

Equation (8) guarantees that the offset will be an integer number. Next, we round the right-hand side:

Here, we can take the step from equation (9) to equation (10) because we have already made sure that the offset will be an integer number.

By using rounding, we are losing the original precision of the floating point representation!

Thus, we have derived the usual formulas for zero-point quantization! Now, what about de-quantization?

De-quantization

After quantizing, how do we obtain back the original values from the floating point representation? Well, we can’t, because we have lost precision after rounding. However, we can obtain an approximation of the original values.

If we re-arrange equation (1):

we obtain the formula for de-quantization:

Since we are now back in the floating point range, we do not need to care about rounding.

What does “zero-point” mean?

If you have read other articles about zero-point quantization, you may have noticed that I have not touched on the subject of the “zero-point”, which you can see in the original formulas at the beginning of this article.

So, what is the zero-point, and why do we care about it?

In several Machine Learning scenarios, the number 0 plays a role that goes beyond computation. For example, during gradient descent, ReLU gates are used to decide which weights of a Neural Network will be updated and which will not. If there is no change for a specific weight, a ReLU gate will output 0 for that weight. If a weight needs to be updated, a ReLu gate will output a value different from 0.

While the Neural Network is stored in floating point representation, checking for 0 is trivial, however, after quantizing, it is not obvious which value in the new range corresponds to the zero in the old range.

The zero-point is the integer number to which the floating point number zerpo 0 gets mapped after quantization

How do we calculate the zero-point? This is fairly easy, we just need to plug in FP = 0 in the equation for quantization (equation 1)!

(Because we are taking 0 in the old floating point range to the new integer range)

Thus, we obtain the following interpretation: the zero-point is the offset of the transformation!

The zero-point is the offset of the transformation!

Replacing this in the equations for quantization and de-quantization (equations 10 and 11), we obtain the usual formulas:

Finally, notice that for zero-point quantization, the re-scaling factor and zero-point (the offset) both depend on the tensor (because they are calculated using the minimum and maximum components of the tensor). Additionally, notice that for de-quantization, we need to have stored somewhere the values of the scale factor and the zero-point. This is why we need equation (7), instead of simply using equation (10).