The Power of Quantization in ML: A PyTorch Tutorial Part 5

8 min readJul 1, 2024

https://cdn.prod.website-files.com/64d9e7e32e307274f238b1ae/65b7d39ab1916d44837f2c91_blog_post_header_07-5.png

In the previous article we learned how to build our own custom 8-Bit Quantizer and how to quantize open source PyTorch models. In this article we will learn how can we load quantized weights from Hugging Face Hub and how to do packing and unpacking of weights. Also we will see how can we fine tune quantized model.

Previous Article: Mastering Quantization Part 4

Load Quantized Weights from HF Hub

It is not efficient to first load the model in its default dtype and then quantize it. In practice, we can maybe quantize the model using a large instance. If we have a big machine we can quantize the model using the machine and then push the quantized weights somewhere on the cloud. And then directly load the model in 8-bit precision or even lower precision on our machine.

Memory Efficient Model Loading

You’ll need your own Hugging Face username in order for it to run.
You’ll add your username in YOUR_HF_USERNAME="".

from huggingface_hub import HfApi, create_repo

YOUR_HF_USERNAME = ""
your_repo_id = f"{YOUR_HF_USERNAME}/opt-125m-quantized-dlai"

api = HfApi()

# create_repo(your_repo_id)

api.upload_file(
 path_or_fileobj="quantized_state_dict.pth",
 path_in_repo="quantized_state_dict.pth",
 repo_id=your_repo_id
)

Using this method we can push our quantized weights onto the Hugging Face Hub.

Load the Model in the Meta Device

We will first load the skeleton of the model inorder to get the architecture of the model, after this we just need to replace all the linear layers with the quantized layers withoud quantizing the model since we don’t have access to the weights as all the weights are in the meta device, means they are not getting initialized. Then we cal call the state_dict which will asign the correct weights. And in this way we save our CPU memory as we directly load the quantized version of the model.

model_id = "./models/facebook/opt-125m"
config = AutoConfig.from_pretrained(model_id)

with torch.device("meta"):
  model = OPTForCausalLM(config)

tokenizer = AutoTokenizer.from_pretrained(model_id)


for param in model.parameters():
  print(param)

Here we have loaded the config of the model to get the detailed architecture. We have loaded our model but the tensors are not initialized at all. So we have bunch of meta tensors that don’t take any RAM.

replace_linear_with_target(model, W8A16LinearLayer, ["lm_head"])

After replacing the linear layers we will load the state dicts.

from huggingface_hub import hf_hub_download

state_dict_cache_path = hf_hub_download("ybelkada/opt-125m-quantized-dlai", "quantized_state_dict.pth")
state_dict = torch.load(state_dict_cache_path)
model.load_state_dict(state_dict, strict=True, assign=True)
# OUTPUT --> <All keys matched successfully>

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
pipe("Hello today I am", max_new_tokens=40)
# OUTPUT --> [{'generated_text': 'Hello today I am a student at the University of California, San Diego.\nI am a student at the University of California, San Diego.\nI am a student at the University of California, San Diego.\n'}]


pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
pipe("Hello today I am giving a course about", max_new_tokens=10)
# OUTPUT --> [{'generated_text': 'Hello today I am giving a course about the history of the world and the history of the'}]

Weights Packing

Assume we want to quantize our model in 4-bit precision, and we want to store the weights in a torch tensor. So ideally we want to create a tensor with some values and pass dtype=torch.int4. But the problem is there is no native support for 4-bit weights in PyTorch.
import torch tensor = torch.tensor([0,1], dtype=torch.int4) # is not supported!The only possible solution is instead of saving the tensor in 4-bit we have to save it in 8-bit as currently it’s the dtype with smallest precision that is available in PyTorch. So in practice we need to save the tensor in 8-bit precision. But this will overhead for large models therefore if we go for native approach, means if we store the 4-bit weights in an 8-bit tensor, there will be no point quantizing the model into 4-bit because all the parameters will be stored in 8-bit precision. So for that we need to pack the 4-bit weights into 8-bit tensor.

How does packing works?

Consider the tensor given bbelow that stores 4 values that can be represented in 2-bit precision, but stored in 8-bit.
import torch unpacked_tensor = torch.tensor([1,0,3,2], dtype=torch.int8)
Currently this tensor is stored as [00000001 00000000 00000011 00000010]. This is not really optimal as we need to allocate 4 times 8-bit terms of memory in order to store weights that can be encoded only in 2-bit.
To solve this we can “pack” all these data points into a single 8-bit tensor as [10110001] →(10,11,00,01). This value in unit8 will end up being 177.

Advantages: It reflects the “true” memory footprint of the quantized weights

Disadvantages: The unpacked tensors need to be a shape with a multiple of 8//nbits. It needs to unpack before performing an operation

Packing 2-bit Weights

# Example Tensor: [1, 0, 3, 2]
    # 1 0 3 2 - 01 00 11 10

    # Starting point of packed int8 Tensor
    # [0000 0000]
    
    ##### First Iteration Start:
    # packed int8 Tensor State: [0000 0000]
    # 1 = 0000 0001
    # 0000 0001
    # No left shifts in the First Iteration
    # After bit-wise OR operation between 0000 0000 and 0000 0001:
    # packed int8 Tensor State: 0000 0001
    ##### First Iteration End

    ##### Second Iteration Start:
    # packed int8 Tensor State: [0000 0001]
    # 0 = 0000 0000
    # 0000 0000
    # 2 left shifts:
    # [0000 0000] (1 shift)-> 0000 0000 (2 shift)-> 0000 0000
    # After bit-wise OR operation between 0000 0001 and 0000 0000:
    # packed int8 Tensor State: 0000 0001
    ##### Second Iteration End

    ##### Third Iteration Start:
    # packed int8 Tensor State: [0000 0001]
    # 3 = 0000 0011
    # 0000 0011
    # 4 left shifts:
    # [0000 0011] (1 shift)-> 0000 0110 (2 shift)-> 0000 1100
    # 0000 1100 (3 shift)-> 0001 1000 (4 shift)-> 0011 0000
    # After bit-wise OR operation between 0000 0001 and 0011 0000:
    # packed int8 Tensor State: 0011 0001
    ##### Third Iteration End

    ##### Fourth Iteration Start:
    # packed int8 Tensor State: [0011 0001]
    # 2 = 0000 0010
    # 0000 0010
    # 6 left shifts:
    # [0000 0010] (1 shift)-> 0000 0100 (2 shift)-> 0000 1000
    # 0000 1000 (3 shift)-> 0001 0000 (4 shift)-> 0010 0000
    # 0010 0000 (5 shift)-> 0100 0000 (6 shift)-> 1000 0000
    # After bit-wise OR operation between 0011 0001 and 1000 0000:
    # packed int8 Tensor State: 1011 0001
    ##### Fourth Iteration End
    
    # Final packed int8 Tensor State: [1011 0001]

def pack_weights(uint8tensor, bits):
    if uint8tensor.shape[0] * bits % 8 != 0:
        raise ValueError(f"The input shape needs to be a mutiple \
        of {8 / bits} - got {uint8tensor.shape[0]}")

    num_values = uint8tensor.shape[0] * bits // 8

    num_steps = 8 // bits # ----> 4

    unpacked_idx = 0

    packed_tensor = torch.zeros((num_values), dtype=torch.uint8)

    # 1 0 3 2 - 01 00 11 10 --> for this each two bits we will retrieve the corresponding value 
    
    # [0000 0000] -> 0000 0001 ==== packed_tensor -> unpacked_tensor (hift these values on left by bits*j)

    # 0000 0001 --> result after bitwise OR operation between [0000 0000] and [0000 0001] 

    # 0000 0000 -> 0000 0000

    # 0000 0000 --> result after bitwise OR operation between [0000 0001] and [0000 0000]

    # 0000 0011 - 0011 0000 - 0011 0001 --> shifting by 4 bits

    # 1011 0001 --> shfiting by 6 bits
    
    for i in range(num_values):
        for j in range(num_steps):
            packed_tensor[i] |= uint8tensor[unpacked_idx] << (bits * j)
            unpacked_idx += 1
    return packed_tensor

unpacked_tensor = torch.tensor([1, 0, 3, 2], dtype=torch.uint8)
pack_weights(unpacked_tensor, 2)
# OUTPUT --> tensor([177], dtype=torch.uint8)

unpacked_tensor = torch.tensor([1, 0, 3, 2, 3, 3, 3, 3], dtype=torch.uint8)
pack_weights(unpacked_tensor, 2)
# OUTPUT --> tensor([177, 255], dtype=torch.uint8)

Unpacking 2-Bit Weights

# Example Tensor: [10110001]
    # Which was Originally: 1 0 3 2 - 01 00 11 10

    # Starting point of unpacked Tensor
    # [00000000 00000000 00000000 00000000]
    
    ##### First Iteration Start:
    # packed int8 Tensor: [10110001]
    # You want to extract 01 from [101100 01]
    # No right shifts in the First Iteration
    # After bit-wise OR operation between 00000000 and 10110001:
    # [10110001 00000000 00000000 00000000]
    # unpacked Tensor state: [10110001 00000000 00000000 00000000]
    ##### First Iteration End

    ##### Second Iteration Start:
    # packed int8 Tensor: [10110001]
    # You want to extract 00 from [1011 00 01]
    # 2 right shifts:
    # [10110001] (1 shift)-> 01011000 (2 shift)-> 00101100
    # After bit-wise OR operation between 00000000 and 00101100:
    # [10110001 00101100 00000000 00000000]
    # unpacked Tensor state: [10110001 00101100 00000000 00000000]
    ##### Second Iteration End

    ##### Third Iteration Start:
    # packed int8 Tensor: [10110001]
    # You want to extract 11 from [10 11 0001]
    # 4 right shifts:
    # [10110001] (1 shift)-> 01011000 (2 shift)-> 00101100
    # 00101100 (3 shift)-> 00010110 (4 shift)-> 00001011
    # After bit-wise OR operation between 00000000 and 00001011:
    # [10110001 00101100 00001011 00000000]
    # unpacked Tensor state: [10110001 00101100 00001011 00000000]
    ##### Third Iteration End

    ##### Fourth Iteration Start:
    # packed int8 Tensor: [10110001]
    # You want to extract 10 from [10 110001]
    # 6 right shifts:
    # [10110001] (1 shift)-> 01011000 (2 shift)-> 00101100
    # 00101100 (3 shift)-> 00010110 (4 shift)-> 00001011
    # 00001011 (5 shift)-> 00000101 (6 shift)-> 00000010
    # After bit-wise OR operation between 00000000 and 00000010:
    # [10110001 00101100 00001011 00000010]
    # unpacked Tensor state: [10110001 00101100 00001011 00000010]
    ##### Fourth Iteration End
    
    # Last step: Perform masking (bit-wise AND operation)
    # Mask: 00000011
    # Bit-wise AND operation between 
    # unpacked Tensor and 00000011
    # [10110001 00101100 00001011 00000010] <- unpacked tensor
    # [00000011 00000011 00000011 00000011] <- Mask
    # [00000001 00000000 00000011 00000010] <- Result

    # Final
    # unpacked Tensor state: [00000001 00000000 00000011 00000010]

def unpack_weights(uint8tensor, bits):
    num_values = uint8tensor.shape[0] * 8 // bits

    num_steps = 8 // bits

    unpacked_tensor = torch.zeros((num_values), dtype=torch.uint8)

    unpacked_idx = 0

    # 1 0 3 2 - 01 00 11 10

    # [00000000 00000000 00000000 00000000]
    # [10110001 00101100 00001011 00000010]
    # [00000001 00000000 00000011 00000010]

    # 10110001
    # 00000011
    
    # 00000001

    # 1: [10110001]
    # 2: [00101100]
    # 3: [00001011]

    mask = 2 ** bits - 1

    for i in range(uint8tensor.shape[0]):
        for j in range(num_steps):
            unpacked_tensor[unpacked_idx] |= uint8tensor[i] >> (bits * j)
            unpacked_idx += 1

    unpacked_tensor &= mask
    return unpacked_tensor

unpacked_tensor = torch.tensor([177, 255], dtype=torch.uint8)
unpack_weights(unpacked_tensor, 2)
# OUTPUT --> tensor([1, 0, 3, 2, 3, 3, 3, 3], dtype=torch.uint8)

Fine-tuning Quantized LLMs

There are two scenarios to consider: first, fine-tuning a model while it is quantized to achieve the best possible quantized model, and second, adapting a model for specific use cases and applications, such as fine-tuning a large language model (LLM) on a custom dataset. The first scenario is feasible through Quantization Aware Training (QAT), where the model is trained to be more accurate after quantization. This method differs from the Post Training Quantization techniques we previously discussed.

For the second scenario, we can utilize Parameter Efficient Fine-Tuning (PEFT) methods. PEFT aims to significantly reduce the number of trainable parameters while maintaining similar performance to full fine-tuning. Combining PEFT with Quantization (QLoRA) leverages both techniques. Using LoRA, we attach extra trainable parameters (blue weights) to a frozen weight (orange weights). The “r” parameter, or “rank,” is typically much smaller than the input hidden state dimension, resulting in very small final optimizer states and making the training process more accessible.

QLoRA enhances this by quantizing the base weights (blue weights) to 4-bit precision and ensuring that the dtype of the activations of the quantized weights matches the dtype of the LoRA weights. This approach allows us to combine the benefits of both quantization and PEFT for optimal results.