Exploring a Bit of LLM: BitNet b1.58

Anirban Bose
The Deep Hub
Published in
6 min readMar 26, 2024
Source:https://computer.howstuffworks.com/bytes.htm

Introduction:

In recent times, Large Language Models (LLMs) have emerged as potent tools within the realm of Natural Language Processing (NLP), showcasing impressive prowess across a broad spectrum of tasks such as language generation, translation, and sentiment analysis. These models, often honed on extensive textual corpora, have truly transformed our interaction with and understanding of natural language. Nevertheless, the growing complexity and sheer size of LLMs have ushered in a myriad of challenges that demand careful scrutiny.

At the forefront of concerns surrounding LLMs lies their ever-expanding bulk. As these models scale up to encompass larger datasets and capture increasingly subtle linguistic nuances, they burgeon in terms of both parameters and layers, resulting in monumental model sizes. This exponential growth in scale presents formidable hurdles in terms of memory utilization, computational demands, and storage requisites. Additionally, the widespread adoption of 32-bit or even 16-bit floating-point representations exacerbates the issue, necessitating more precise arithmetic operations and further amplifying the computational burden.

For example:

Suppose for an “Llama-7B,” an LLM with 7 billion parameters, using 32-bit precision for each. This translates to:

  • Total Memory: Size of one weight * Number of weights
  • Total Memory: 4 bytes/weight * 7,000,000,000 weights
  • Total Memory: 28,000,000,000 bytes
  • Total Memory (GB): 28,000,000,000 bytes / 1,073,741,824 bytes/GB ≈ 26.09 GB

1-Bit LLMs:

1-bit LLMs, or 1-bit Large Language Models, represent a groundbreaking approach to tackling the challenges posed by the immense size of traditional LLMs. In essence, these models depart from the conventional practice of storing parameters with high precision, typically utilizing 32-bit or 16-bit floating-point numbers. Instead, they adopt a revolutionary strategy where each parameter is stored using only a single bit of information.

The core issue with traditional LLMs lies in their vast parameter count, often numbering in the billions. Storing each of these parameters with high precision incurs a substantial memory footprint, making deployment on resource-constrained devices, such as mobile phones, exceptionally difficult. By employing 1-bit representations for parameters, 1-bit LLMs drastically reduce the memory requirements, enabling efficient deployment across a broader range of platforms.

BitLinear workflow and BitNet architecture:

Source:https://arxiv.org/pdf/2310.11453.pdf

Idea behind 1-bit LLM: Quantization

Artificial neural networks are composed of activation nodes, interconnecting connections, and weight parameters associated with each connection. These weight parameters and activation node computations are prime candidates for quantization. Executing a neural network on hardware often involves countless multiplication and addition operations. By utilizing lower-bit mathematical operations with quantized parameters and quantizing intermediate calculations, significant computational gains and heightened performance can be achieved.

Moreover, quantized neural networks offer notable power efficiency advantages for two main reasons. Firstly, they incur reduced memory access costs by employing lower-bit quantized data, which necessitates less data movement both within and outside the chip. This reduction in memory bandwidth translates to substantial energy savings. Secondly, lower-precision mathematical operations, such as an 8-bit integer multiply as opposed to a 32-bit floating-point multiply, consume less energy and enhance compute efficiency, thereby reducing overall power consumption

Source:https://www.qualcomm.com/news/onq/2019/03/heres-why-quantization-matters-ai

For example:

The total memory occupied by 1-bit model is as follows:

Total Memory = Size of one weight * Number of weights

Total Memory = 0.125​ bytes * 7,000,000,000

Total Memory = 875,000,000 bytes

Converting this to gigabytes (GB), we get:

Total Memory = 875,000,000 bytes / 102⁴³ bytes per GB

Total Memory ≈ 0.815 GB

BitNet b1.58:

The latest variant of 1-bit LLM where every parameter is ternary, taking on values of {-1, 0, 1} is called called BitNet b1.58. Unlike its predecessor, BitNet b1.58 replaces the conventional nn.Linear layers with BitLinear layers, leveraging 1.58-bit weights and 8-bit activations.

Source: https://arxiv.org/pdf/2402.17764.pdf
Source:https://arxiv.org/pdf/2402.17764.pdf

What is Pareto improvement?

Under the rubric of neoclassical economic theory, a Pareto improvement refers to a change in resource allocation that makes at least one person better off without making anyone else worse off. These improvements can continue until reaching a Pareto equilibrium, where no further changes can be made without negatively affecting someone. This concept guides policymakers in striving for efficient and equitable outcomes in resource allocation.

Quantization function:

For the weights to be in {-1,0,1}, an quantization function is used given by formulation:

Source: https://arxiv.org/pdf/2402.17764.pdf

A brief explanation is that the above functions first scales the weight matrix by taking the average absolute values so that the center of the distribution of the weights is zero and then round-off the weights to the nearest integer in the set {-1,0,1} and hence obtaining the ternary weight setup.

Highlights:

  • Expansion of Ternary Parameter Scheme: Unlike traditional 1-bit LLMs, BitNet b1.58 introduces an additional value of 0 to the parameter scheme. This expansion results in 1.58 bits in the binary system, offering finer granularity in representing model weights.
  • Retention of Core Benefits: BitNet b1.58 retains all the core benefits of the original 1-bit BitNet. It preserves the novel computation paradigm characterized by minimal multiplication operations for matrix multiplication, ensuring high optimization.
  • Comparable Energy Consumption: Similar to the original 1-bit BitNet, BitNet b1.58 maintains the same energy consumption levels. This consistency underscores its efficiency and sustainability in resource utilization.
  • Enhanced Efficiency: BitNet b1.58 surpasses FP16 LLM baselines in terms of efficiency metrics such as memory consumption, throughput, and latency. This efficiency makes it a compelling choice for deployment in resource-constrained environments.
  • Stronger Modeling Capability: BitNet b1.58 introduces explicit support for feature filtering, leveraging the inclusion of 0 in model weights. This enhancement significantly strengthens its modeling capability, enabling more precise and contextually relevant language processing outcomes.
  • Performance Improvement for 1-bit LLMs: The explicit support for feature filtering in BitNet b1.58 represents a notable advancement for 1-bit LLMs. By facilitating improved performance, it unlocks new possibilities for applications across various domains.

Comparisons:

Here BitNet b1.58 is compared to FP16 LLaMA LLM in various sizes.

Source:https://arxiv.org/pdf/2402.17764.pdf
Source:https://arxiv.org/pdf/2402.17764.pdf
Source:https://arxiv.org/pdf/2402.17764.pdf

Potential directions of work:

  1. 1-bit Mixture-of-Experts (MoE) LLMs offer a cost-effective solution by reducing computation while addressing high memory consumption and inter-chip communication overhead. Their 1.58-bit counterpart further mitigates these challenges by reducing memory footprint and activation transfer overhead, potentially enabling deployment on a single chip.
  2. In the LLM era, native support for long sequences is crucial. BitNet b1.58 addresses this by halving activation bits, enabling a doubled context length with the same resources, with potential further compression to 4 bits or lower for 1.58-bit LLMs, a future avenue for exploration.
  3. 1.58-bit LLMs offer significant performance improvements for edge and mobile devices due to reduced memory and energy consumption, enabling deployment and expanding applications previously unfeasible. Additionally, their compatibility with CPU devices enhances efficiency, further enhancing device capabilities.
  4. Emerging advancements like Groq5 highlight the potential of tailored hardware such as LPUs for LLMs. Efforts towards designing novel hardware and systems specifically optimized for 1-bit LLMs can be directed so as to capitalize on the potential of BitNet.

Overall, BitNet b1.58 represents a significant milestone in the evolution of large language models, paving the way for more efficient, accurate, and versatile language processing systems. As researchers and developers continue to innovate and explore the potential of 1-Bit versions of LLM, we can expect further advancements that push the boundaries of what’s possible in natural language understanding and generation.

References

https://arxiv.org/pdf/2310.11453.pdf

https://arxiv.org/pdf/2402.17764.pdf

--

--