The Era of 1-bit LLMs

Published in

neoxia

7 min readMar 13, 2024

Introduction:

In the realm of artificial intelligence, Large Language Models (LLMs) have emerged as indispensable tools for natural language processing tasks. However, their computational demand and energy consumption have raised concerns regarding sustainability and scalability. In response, methods such as quantization permit to reduce the memory footprint and computational demand during inference, enabling deployment on resource constraint devices. Recently, Microsoft research team has introduced 1-bit LLMs, such as BitNet b1.58, which promise to revolutionize AI by achieving comparable performance with significantly reduced resource requirements.

Insights on Quantization

Quantization involves representing weights, biases, and activations in neural networks using lower-precision data types, such as 8-bit integers, instead of conventional 32-bit floating-point representation. This reduces memory usage and computational demands during inference, enabling deployment on devices with limited resources or reducing overall consumption.

Quantization Types

Float32 to Float16 Quantization: Transitioning from 32-bit floating-point to 16-bit floating-point representation facilitates a straightforward conversion process, but requires compatibility with float16 operations and hardware support.
Float32 to bfloat16 Quantization: Similar to float16, bfloat16 quantization offers greater dynamic range compared to float16, enabling efficient representation of weights and activations.
Float32 to Int8 Quantization: Performing quantization to go from float32 to int8 is trickier. Only 256 values can be represented in int8, while float32 can represent a wide range of values. The idea is to find the best way to project a range of float32 values into int8 space.

*Figure 1 : different types of quantization*

Post-Training Quantization (PTQ)

PTW involves quantizing a model after training completion, typically from 32-bit floating-point to 8-bit integers. Such an approach offers benefits like reduced memory consumption and faster inference but may lead to accuracy loss.

Methods:

Multiple methods can be used to apply PTQ quantization, among which:

GPTQ: Post-Training Quantization for GPT Models : The fundamental concept behind GPTQ is to compress all model weights into a 4-bit quantization format by minimizing the mean squared error associated with each weight on GPU.

GGUF: GPT-Generated Unified Format : enables LLMs to run efficiently on CPU while utilizing GPU acceleration for select layers, offering improved speed. Unlike GPU-focused methods like GPTQ, GGUF accommodates users lacking GPU hardware. Despite potential inference speed differences, GGUF remains valuable, particularly with the rise of efficient models like Mistral 7B. Integration is straightforward with the transformers package.

Quantization-Aware Training (QAT):

Quantization-Aware Training (QAT) introduces ‘fake’ quantization modules during training, enabling models to adapt to low-precision weights and mitigate quantization errors. The loss function fine-tunes the model to consider these errors, enhancing accuracy post-quantization.

*Figure 3 : Quantization-Aware Training*

Methods

Naive Quantization: Uniformly quantizes all operators, resulting in a uniform accuracy drop. Easy to implement but ignores varying layer sensitivities.
Hybrid Quantization: Balances precision by quantizing some operators to INT8 while keeping others at higher precision. Requires knowledge of sensitivity but offers improved accuracy and latency.
Selective Quantization: Quantizes specific operators to INT8 with diverse calibration methods, accommodating varying sensitivities. Maximizes accuracy and minimizes latency by tailoring quantization parameters.

Incorporating quantization during training optimizes weights, eliminating the need for distinct calibration. Naive quantization is straightforward but sacrifices accuracy, while hybrid and selective quantization effectively balance precision and performance.

To sum up

When facing the scarcity of training data and the need for fast quantization, Post-Training Quantization (PTQ) is often chosen, despite lower precision compared with Quantization-Aware Training (QAT). QAT consistently achieves higher accuracy with equivalent bit precision, making it the preferred option. However, QAT requires longer training, spanning hundreds of epochs, and incurs higher computational retraining costs. Yet, this extended training period is often justified for long-term deployment models, where hardware and energy efficiency benefits outweigh retraining expenses.

In addition to these considerations, quantization offers several advantages, including trimmed memory consumption, reduced energy consumption, turbocharged inference, and the ability to embed models in resource-limited devices. However, navigating quantization challenges such as overflow and underflow, selecting between symmetric and affine quantization, and deciding on per-tensor versus per-channel quantization adds complexity to the process. Careful consideration of these factors is essential for effectively implementing quantization techniques.

Binary Quantization:

Once the basic concepts of quantization are understood, let explore a new solution, Binary quantization. In this method, the quantization is push to the extreme with the weight reduced to 1-bit. The original BitNet study introduces a 1-bit Transformer architecture tailored for large language models, aiming for efficient scaling in memory and computation. BitNet utilizes low-precision binary weights and quantized activations while maintaining precision for optimizer states and gradients during training. The breakthrough concept of 1-bit LLMs is to utilize ternary parameters {-1, 0, 1}. This approach drastically decreases energy consumption and memory usage while preserving model performance.

BitNet b1.58: A Deep Dive:

*Figure 4 :* (a) The computation flow of BitLinear. (b) The architecture of BitNet, consisting of the stacks of attentions and FFNs, where matrix multiplication is implemented as BitLinear. [7]

BitNet b1.58 is based on the BitNet architecture, BitNet uses the same layout as Transformers, stacking blocks of self-attention and feed-forward networks. However, compared with vanilla Transformer, BitNet uses BitLinear instead of conventional nn.linear Pytorch matrix multiplication, which employs binarized (i.e., 1-bit) model weights. The addition of BitNet b1.58 is to introduce a 0 value in the binarization permitting to introduce feature filtering in the overall architecture.

The method aims to restrict the weights of the model to only three values: -1, 0, or +1. The basic concept is decomposed in two parts :

Scaling: The weight matrix W is first scaled by its average absolute value, represented by gamma.
Rounding: After scaling, each value in the weight matrix is rounded to the nearest integer among {-1, 0, +1}.

The scaling factor gamma is calculated as the reciprocal of the average absolute value of the weight matrix W (denoted by Wij). This process ensures that the weights are scaled appropriately before rounding, helping to maintain the relative importance of different weights in the model. In summary, the method involves scaling the weights by their average absolute value and then rounding each scaled value to the nearest integer among {-1, 0, +1}.

*Figure 5: 1-bit LLMs (e.g., BitNet b1.58) approximation [8]*

Performance Evaluation: To assess the efficacy of BitNet b1.58, comprehensive performance evaluations were conducted across various benchmarks and compared with LLaMA FP16 bits of various sized.

*Figure 6: Decoding latency (Left) and memory consumption (Right) of BitNet b1.58 varying the model size. [8]*

Results indicate that BitNet b1.58 achieves comparable performance to its FP16 counterparts while consuming significantly fewer computational resources. Moreover, BitNet b1.58 demonstrates superior scalability, outperforming larger FP16 LLMs in terms of efficiency and latency.

Energy consumption:

Large language models have demonstrated remarkable performance, but their increasing size has posed challenges for deployment and raised concerns about their environmental and economic impact due to high energy consumption. The proposed solution permit significantly reduce energy consumption as shown in the study conducted by Microsoft (Figure 7).

Conclusion:

BitNet b1.58 exemplifies the transformative potential of 1-bit LLMs in revolutionizing AI applications. By striking a balance between efficiency and performance, BitNet b1.58 offers a compelling solution to the scalability and sustainability challenges facing contemporary AI systems.

Generative AI is seen by many as a major revolution in many fields, its energy footprint is not too be negliged and will be without any doubt the next biggest challenge to address. As AI research continues to evolve, the journey towards smarter, more resource-efficient AI is propelled forward, opening new frontiers for innovation and discovery. With BitNet b1.58 leading the charge, the era of 1-bit LLMs promises to redefine the landscape of artificial intelligence, ushering in a new era of efficiency and scalability.

Resources:

https://huggingface.co/docs/optimum/concept_guides/quantization
https://medium.com/@abonia/llm-series-quantization-overview-1b37c560946b
https://medium.com/towards-data-science/which-quantization-method-is-right-for-you-gptq-vs-gguf-vs-awq-c4cd9d77d5be
https://celikmustafa89.medium.com/tutorial-full-integer-quantization-and-hybrid-quantization-35aaad2d6e3
https://iprathore71.medium.com/diving-deeper-into-quantization-realm-9c73e3172a3c
https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34
WANG, Hongyu, MA, Shuming, DONG, Li, et al. Bitnet: Scaling 1-bit transformers for large language models. arXiv preprint arXiv:2310.11453, 2023.
Shuming Ma, Hongyu Wang, Lingxiao Ma et al. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arXiv preprint arXiv:2402.17764