Model Quantization with 🤗 Hugging Face Transformers and Bitsandbytes Integration
Introduction:
This blog post explores the integration of Hugging Face’s Transformers library with the Bitsandbytes library, which simplifies the process of model quantization, making it more accessible and user-friendly.
What is Model Quantization?
Quantization is a technique used to reduce the precision of numerical values in a model. Instead of using high-precision data types, such as 32-bit floating-point numbers, quantization represents values using lower-precision data types, such as 8-bit integers. This process significantly reduces memory usage and can speed up model execution while maintaining acceptable accuracy.
Hugging Face and Bitsandbytes Uses
Hugging Face’s Transformers library is a go-to choice for working with pre-trained language models. To make the process of model quantization more accessible, Hugging Face has seamlessly integrated with the Bitsandbytes library. This integration simplifies the quantization process and empowers users to achieve efficient models with just a few lines of code.
Install latest accelerate from source:
pip install git+https://github.com/huggingface/accelerate.git
Install latest transformers from source and bitsandbytes:
pip install git+https://github.com/huggingface/transformers.git
pip install bitsandbytes
Loading a Model in 4-bit Quantization
One of the key features of this integration is the ability to load models in 4-bit quantization. This can be done by setting the load_in_4bit=True
argument when calling the .from_pretrained
method. By doing so, you can reduce memory usage by approximately fourfold.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "bigscience/bloom-1b7"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_4bit=True)
Loading a Model in 8-bit Quantization
For further memory optimization, you can load a model in 8-bit quantization. This can be achieved by using the load_in_8bit=True
argument when calling .from_pretrained
. This reduces the memory footprint by approximately half.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "bigscience/bloom-1b7"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True)
You can even check the memory footprint of your model using the get_memory_footprint
method:
print(model.get_memory_footprint())
Other Use cases:
The Hugging Face and Bitsandbytes integration goes beyond basic quantization techniques. Here are some use cases you can explore:
Changing the Compute Data Type
You can modify the data type used during computation by setting the bnb_4bit_compute_dtype
to a different value, such as torch.bfloat16
. This can result in speed improvements in specific scenarios. Here's an example:
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
Using NF4 Data Type
The NF4 data type is designed for weights initialized using a normal distribution. You can use it by specifying bnb_4bit_quant_type="nf4"
:
from transformers import BitsAndBytesConfig
nf4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)
Nested Quantization for Memory Efficiency
The integration also recommends using the nested quantization technique for even greater memory efficiency without sacrificing performance. This technique has proven beneficial, especially when fine-tuning large models:
from transformers import BitsAndBytesConfig
double_quant_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True)
model_double_quant = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=double_quant_config)
Loading a Quantized Model from the Hub
A quantized model can be loaded with ease using the from_pretrained
method. Make sure the saved weights are quantized by checking the quantization_config
attribute in the model configuration:
model = AutoModelForCausalLM.from_pretrained("model_name", device_map="auto")
In this case, you don’t need to specify the load_in_8bit=True
argument, but you must have both Bitsandbytes and Accelerate library installed.
Exploring Advanced techniques and configuration
There are additional techniques and configurations to consider:
Offloading Between CPU and GPU
One advanced use case involves loading a model and distributing weights between the CPU and GPU. This can be achieved by setting llm_int8_enable_fp32_cpu_offload=True
. This feature is beneficial for users who need to fit large models and distribute them between the GPU and CPU.
Adjusting Outlier Threshold
Experiment with the llm_int8_threshold
argument to change the threshold for outliers. This parameter impacts inference speed and can be fine-tuned to suit your specific use case.
Skipping the Conversion of Some Modules
In certain situations, you may want to skip the conversion of specific modules to 8-bit. You can do this using the llm_int8_skip_modules
argument.
Fine-Tuning a Model Loaded in 8-bit
With the support of adapters in the Hugging Face ecosystem, can fine-tune models loaded in 8-bit quantization, enabling the fine-tuning of large models with ease.
Conclusion
Quantization is a powerful technique for optimizing machine learning models. The integration of Hugging Face’s Transformers library with the Bitsandbytes library makes this technique accessible to a broader audience. Whether you’re looking to reduce memory usage, speed up model execution, or share quantized models with the community, this integration provides the tools and flexibility you need to do so. It’s a significant step towards making efficient machine learning models available to all.
Reference Links:
https://huggingface.co/docs/optimum/concept_guides/quantization
https://huggingface.co/docs/transformers/main_classes/quantization