The Power of Quantization in ML: A PyTorch Tutorial Part 1

8 min readJul 1, 2024

https://cdn.prod.website-files.com/64d9e7e32e307274f238b1ae/65b7d39ab1916d44837f2c91_blog_post_header_07-5.png

Large generative AI models, such as LLMs, can be so large that they are difficult to run on consumer-grade hardware. For instance, the smallest LLaMA2 model has 7 billion parameters. If each parameter is 32 bits, storing these parameters would require 7x10⁹x32/8x10⁹ = 28 GB.
During inference, all parameters need to be loaded into memory, making it challenging to run large models on standard PCs or smartphones. Quantization is a key tool to address this issue.

How to decide whether we should use int8 or float16 to compress the model? Just like humans, computers are slow at computing floating-point operations compared to integer operations.

Quantization involves representing model weights in lower precision. For example, a matrix stored in float32, the default data type for most models, allocates 4 bytes per parameter (4x8-bit precision), resulting in a total memory footprint of 36 bytes. If we quantize the weights to 8-bit precision (int8), each parameter only requires 1 byte, reducing the total storage to just 9 bytes. However, this reduction introduces quantization error. The challenge of state-of-the-art quantization methods is to minimize this error to prevent performance degradation. Using quantization, a model that typically requires 10 GB of storage can be compressed to less than 1 GB, depending on the type of quantization applied.

Advantages of Quantization

Less memory consumption when loading models
Less inference time due to simpler data types
Less energy consumption, because inference takes less computation overall

Data Types and Sizes

Integers

An unsigned integer data type represents a positive integer. The range of an n-bit unsigned integer is [0, 2ⁿ⁻¹]. For an 8-bit integer, the minimum value is 0 and the maximum is 255. The computer allocates a sequence of 8 bits to store the 8-bit integer. In an unsigned integer, the decoding process works as follows: if a bit is 0, the value is 0; if a bit is 1, the decoded value is a power of 2. Specifically, the first bit corresponds to 2⁰, the second bit to 2¹, and so on, with the 8th bit corresponding to 2⁷.
For signed integers (used to represent negative or positive integers), 2’s complement representation is used. Range is [-2ⁿ⁻¹, 2ⁿ⁻¹–1]. Example with 8-bit (torch.int8) [-128, 127]. Here the bit in the last position (left most) will have a negative value, for n-bit (-2ⁿ⁻¹).
Positive Integer: [1,0,0,0,1,0,0,1] → 20+0+0+23+0+0+0+27 = 137.
Negative Integer: [1,0,0,0,1,0,0,1] → 20+0+0+23+0+0+0+(-27) = 119.

Here we will be using torch.iinfo() method of PyTorch. This function is similar to that of the NumPy function, np.iinfo() which returns information about the data types with the smallest and largest values that can be represented by that type.

# Information of 8-bit unsigned integer
torch.iinfo(torch.uint8)
# OUTPUT --> iinfo(min=0, max=255, dtype=uint8)

# Information of 8-bit signed integer
torch.iinfo(torch.int8)
# OUTPUT --> iinfo(min=-128, max=127, dtype=int8)

# Information of 16-bit signed integer
torch.iinfo(torch.int16)
# OUTPUT --> iinfo(min=-32768, max=32767, dtype=int16)

# Information of 32-bit signed integer
torch.iinfo(torch.int32)
# OUTPUT --> iinfo(min=-2.14748e+09, max=2.14748e+09, dtype=int32)

# Information of 64-bit signed integer
torch.iinfo(torch.int64)
# OUTPUT --> iinfo(min=-9.22337e+18, max=9.22337e+18, dtype=int64)

Floating Point

There are three components:
sign :- positive, negative and always 1 bit
exponent :- range, impact the representable range of number
fraction :- precision, impact on the precision of the number

Here precision means defining a number as 0.4999999 or just 0.5.
FP32, BF16, FP16, FP8 are floating point formats with a specific number of bits for exponent and the fraction.

1. Floating Point 32

sign :- 1 bit
exponent (range) :- 8 bit
fraction (precision) :- 23 bit
Total :- 32 bit
For positive values we can define very small numbers as 10⁻⁴⁵ and as big as 10³⁸. For negative values the range is the same with a minus sign in front. For FP we have two formulas to decode the sequence. First to represent very small values which are also called subnormal values (E=0) -1SF2–126 and second to represent very big values called normal values (E!=0) -1S(1+F)2E-127. This data type is very important in ML since most models store weights in FP32.

2. Floating Point 16

sign :- 1 bit
exponent (range) :- 5 bit
fraction (precision) :- 10 bit
Total :- 16 bit
Here we have only 6 bits for the exponent and 10 for fraction. So the smallest positive value is 10⁻⁸ and as big as 10⁴.

3. Brain Floating Point 16

sign :- 1 bit
exponent (range) :- 8 bit
fraction (precision) :- 7 bit
Total :- 16 bit
Here we have 8bits for the exponent and 7 for fraction. So the smallest positive value is 10⁻⁴¹ and as big as 10³⁸. Compared with FP16 we have more range to store. But the downside is the precision.

FP32 → best precision → max ~ 10³⁸
FP16 → better precision → max ~ 10⁴
BFP16 → good precision → max ~ 10³⁸

# by default, python stores float data in FP64
value = 1/3

# Let's check the number that we stored till 60 decimal values
format(value, '.60f')
# OUTPUT --> '0.333333333333333314829616256247390992939472198486328125000000'

tensor_fp64 = torch.tensor(value, dtype = torch.float64)
tensor_fp32 = torch.tensor(value, dtype = torch.float32)
tensor_fp16 = torch.tensor(value, dtype = torch.float16)
tensor_bf16 = torch.tensor(value, dtype = torch.bfloat16)

print(f"fp64 tensor: {format(tensor_fp64.item(), '.60f')}")
print(f"fp32 tensor: {format(tensor_fp32.item(), '.60f')}")
print(f"fp16 tensor: {format(tensor_fp16.item(), '.60f')}")
print(f"bf16 tensor: {format(tensor_bf16.item(), '.60f')}")

# fp64 tensor: 0.333333333333333314829616256247390992939472198486328125000000
# fp32 tensor: 0.333333343267440795898437500000000000000000000000000000000000
# fp16 tensor: 0.333251953125000000000000000000000000000000000000000000000000
# bf16 tensor: 0.333984375000000000000000000000000000000000000000000000000000

Observe that the lesser bits we have, the less precise the approximation will be. As mentioned above precision is worst for bfloat16 we can clearly see that it only gives the value till 9 decimal places.

# Information of `16-bit floating point`
torch.finfo(torch.float16)
# OUTPUT --> finfo(resolution=0.001, min=-65504, max=65504, eps=0.000976562, smallest_normal=6.10352e-05, tiny=6.10352e-05, dtype=float16)

# Information of `16-bit brain floating point`
torch.finfo(torch.bfloat16)
# OUTPUT --> finfo(resolution=0.01, min=-3.38953e+38, max=3.38953e+38, eps=0.0078125, smallest_normal=1.17549e-38, tiny=1.17549e-38, dtype=bfloat16)

# Information of `32-bit floating point`
torch.finfo(torch.float32)
# OUTPUT --> finfo(resolution=1e-06, min=-3.40282e+38, max=3.40282e+38, eps=1.19209e-07, smallest_normal=1.17549e-38, tiny=1.17549e-38, dtype=float32)

# Information of `64-bit floating point`
torch.finfo(torch.float64)
# OTUPUT --> finfo(resolution=1e-15, min=-1.79769e+308, max=1.79769e+308, eps=2.22045e-16, smallest_normal=2.22507e-308, tiny=2.22507e-308, dtype=float64)

Downcasting

Downcasting happens when we convert a higher data type to a lower data type. The value will be converted to the nearest value in the lower data type. For example FP32 = 0.1 downcasted to an 8-bit integer will be converted to 0, hence there is loss of data.

Advantages

Reduced memory footprint.
More efficient use of GPU memory.
Enables the training of larger models.
Enables larger batch sizes.
Increased computation and speed.
Computation using low precision (FP16, BF16) can be faster than FP32 since it requires less memory.
Depending on the hardware (Google TPU, NVIDIA A100).

Disadvantages

Less precision; we are using less memory, hence computation is less precise.

Use case of Downcasting

Mixed precision training:
- Do computation in smaller precision (FP16/BF16/FP8).
- Store and update the weights in higher precision (FP32).

Let’s create a tensor of FP32 data type and downcaste it to BFP16 using the .to() method of PyTorch.

tensor_fp32 = torch.rand(1000, dtype = torch.float32)
tensor_fp32[:5]
# OUTPUT --> tensor([0.9997, 0.9861, 0.8572, 0.2733, 0.2319])

tensor_fp32_to_bf16 = tensor_fp32.to(dtype = torch.bfloat16)
tensor_fp32_to_bf16[:5]
# OUTPUT --> tensor([1.0000, 0.9844, 0.8555, 0.2734, 0.2314], dtype=torch.bfloat16)

We can see that after downcasting the values are changed but they are very close to the original ones. Let’s check the impact of downcasting on multiplication. For this we will use the .dot() method of PyTorch. We will first multiply the FP32 tensor with itself. Next we multiply BFP16 tensor with itself.

m_float32 = torch.dot(tensor_fp32, tensor_fp32)
m_float32
# OUTPUT --> tensor(313.5908)

m_bfloat16 = torch.dot(tensor_fp32_to_bf16, tensor_fp32_to_bf16)
m_bfloat16
# OUTPUT --> tensor(314., dtype=torch.bfloat16)

Loading ML Models with Different Data Types

Each layer in a model contains weights that are used during inference to generate predictions. Typically, these weights are stored as matrices of learnable parameters, which can be represented with different precisions. For instance, consider a model with 12 layers, where each layer’s weights have n parameters stored in 32-bit precision. Inspecting the data type of a model is essentially the same as inspecting the data type of the model’s weights.

from transformers import BlipForConditionalGeneration
model_name = "Salesforce/blip-image-captioning-base"
model = BlipForConditionalGeneration.from_pretrained(model_name)

To get the memory footprint (how much memory is required in MBs, GBs, etc) of the model we can use the model.get_memory_footprint() method.

fp32_mem_footprint = model.get_memory_footprint()
print("Footprint of the fp32 model in bytes: ", fp32_mem_footprint)
print("Footprint of the fp32 model in MBs: ", fp32_mem_footprint/1e+6)
# Footprint of the fp32 model in bytes:  989660400
# Footprint of the fp32 model in MBs:  989.6604

To load the model in different precision just pass the parameter torch_dtpye inside the from_pretrained() function.

model_bf16 = BlipForConditionalGeneration.from_pretrained(model_name, torch_dtype=torch.bfloat16)
bf16_mem_footprint = model_bf16.get_memory_footprint()

relative_diff = bf16_mem_footprint / fp32_mem_footprint

print("Footprint of the bf16 model in MBs: ", bf16_mem_footprint/1e+6)
print(f"Relative diff: {relative_diff}")
# Footprint of the bf16 model in MBs:  494.832248
# Relative diff: 0.5000020693967345

We can see that the model’s size is reduced to half the original size.

Comparing the FP32 and BF16 Model’s Performance

from transformers import BlipProcessor
processor = BlipProcessor.from_pretrained(model_name)

def get_generation(model, processor, image, dtype):
  inputs = processor(image, return_tensors="pt").to(dtype)
  out = model.generate(**inputs)
  return processor.decode(out[0], skip_special_tokens=True)

def load_image(img):
    image = Image.open(img).convert('RGB')
    return image

from IPython.display import display

image = load_image('dinner.jpg')
display(image.resize((500, 350)))

results_fp32 = get_generation(model,processor,image,torch.float32)
print("fp32 Model Results:\n", results_fp32)
# fp32 Model Results: a group of women sitting around a table eating

results_bf16 = get_generation(model_bf16,processor,image,torch.bfloat16)
print("bf16 Model Results:\n", results_bf16)
# bf16 Model Results: a group of women sitting around a table eating

In both cases, the results appear quite similar and accurate. The reason the generated token can be affected here is that errors between the FP32 logits and the BF16 logits accumulate across layers. Since the model is auto-regressive (using the results of the previous iteration), these errors continue to build up, eventually impacting the model’s prediction. However, this generally doesn’t affect performance significantly. We can use BF16 on the CPU and FP16 on the GPU without issues.

Until now, we have been loading the model in FP32 and then casting it to BF16, which can be problematic in practice, especially in production. This approach requires loading the larger model first and then converting it to the desired data type, which is not memory efficient. Instead, we can directly load the model in the desired dtype to save memory. This can be achieved using the .set_default_dtype() method from PyTorch, which sets the default dtype so that the model is loaded in the desired data type right from the start.

Next Article: Mastering Quantization Part 2