How to shrink Large Language Models. Quantization and 1-bit LLMs explained

6 min readMar 14, 2024

The Memory Challenge

It’s in their name: Large Language Models are LARGE, containing billions of parameters. Think of these parameters as the model’s memory, the greater their number, the better model’s potential performance. But this sheer size brings a big challenge — running out of memory while trying to train them. Indeed, LLMs demand tons of memory and computational power, and that’s just the tip of the iceberg.

Let’s do some quick math to develop an intuition of the problem:

Let’s acknowledge that parameters take up space
A single parameter takes up typically 4 bytes as a 32-bit float
A billion parameters? That’s 4GB of GPU RAM just to store the model’s weights at full precision. This is already a lot of memory! 😱
Now consider the Adam optimizer, gradients, and everything else needed to actually train the beast. This can easily lead to 20 extra bytes of memory per model parameter as shown in the picture below.

In fact, you’ll need roughly 6x the amount of GPU RAM compared to the model weights alone. So, for a 1 billion parameter model at full precision, get ready to spin 24GB of GPU RAM. 😵‍💫

Reality Check

To put things in perspective, the average computer has between 2 and 8GB GPU RAM. This makes training even moderately sized LLMs almost impossible on consumer hardware. And even powerful data centre equipment will struggle if you want to train everything on a single processor.

Quantization

What options do you have to reduce the memory required for training?

✨ Quantization ✨

One way to overcome the memory hurdle is to reduce the precision used to store model weights. Instead of the standard 32-bit floating-point format (FP32), you can use:

16-bit floating-point (FP16): cuts memory usage in half, often with minimal impact on accuracy.
8-bit integers (INT8): provides even greater memory savings. However, can lead to significant accuracy loss.
BFLOAT16: a 16-bit format with a similar range to FP32, offering a balance between memory efficiency and precision.

This process is called quantization, and it works by statistically projecting the original 32-bit floating-point numbers into a lower-precision space. This is achieved using scaling factors calculated based on the range and distribution of the original 32-bit values.

📚 Understanding Floating-Point Representation

Computers always have their own way with numbers, don’t they? These don’t store numbers exactly the way we write them but rather in 0 and 1 format, where 0 is for positive numbers and 1 is for negative numbers.

In floating-point formats, a number is broken down into:

Sign Bit: tells us if the number is positive or negative.
Exponent Bits: like scientific notation, these control the overall scale of the number.
Mantissa Bits: these hold the digits of the number, determining its precision.

To understand how quantization works, let’s take an example and shrink Pi from 32-bit to 16-bit floating-point format.

Squeeze — FP32 to FP16

Noticeably, FP32 has more bits than FP16, hence it is more precise, as you can see in the picture below. To quantize Pi, we essentially squeeze its numerical representation into a smaller space. Imagine Pi shrinking slightly fitting into a smaller shape. This might cause it to lose its shape and round from 3.1415… to something like 3.1406…

We do lose a tiny bit of detail, but quantization frees up a lot of memory. This precision vs. memory trade-off is often acceptable when training LLMs, allowing to work with these huge models on more limited hardware.

Shrinking Pi from FP32 ro FP16. Source: deeplearning.ai

The BFLOAT16 Squeeze — the better Squeeze 😎

BFLOAT16 developed at Google Brain has become a popular choice in deep learning, powering the pre-training of LLMs like FLAN-T5. Think of it as a hybrid between FP16 and FP32. BFLOAT16 uses only 16 bits, like FP16, but keeps the full 8-bit exponent of FP32 as shown in the picture below. What’s the effect?

This maintains the wide dynamic range needed for complex models, while cutting down memory use and speeding up calculations. BFLOAT16 is not ideal for integer-heavy calculations but it does shine in deep learning which requires floating-point operations.

Shrinking Pi from FP32 to BFLOAT16. Source: deeplearning.ai

INT8 — A squeeze to an OOMPA LOOMPA parameter?

To draw a full picture, let’s see what happens when you quantize Pi from 32-bit into even lower precision 8-bit space. Imagine squeezing it from its spacious 32-bit space into a narrow 8-bit world. With just seven bits left to work with, INT8 gives us a tiny world of numbers from -128 to 127. Pi is going to be separated from its sibling decimals and become a plain 3. We now cut down the memory from 4 to 1 bytes, but obviously Pi is now tragically imprecise.

Shrinking Pi from FP32 ro INT8. Source: deeplearning.ai

1-bit LLMs

A new scaling law 💥

In the recent research paper The Era of 1-bit LLMs the authors introduced a promising approach for reducing the cost of LLMs even more while maintaining their performance.

What’s the catch? Vanilla LLMs are trained in 16-bit floating values (FP16 or BF16), and the heavy lifting happens at matrix multiplication. So, the vast computation cost comes from the floating-point addition and additional operation. In contrast, 1-bit LLMs involve only integer addition which saves ton of memory and money.

The new girl — BitNet b1.58

The paper introduces BitNet b1.58 which as it’s name suggest takes 1.58 bytes per parameter. Mind you — that’s almost 2 bytes, not 1! but let’s go on. The authors added an additional value of 0 to the original 1-bit BitNet, resulting in 1.58 bits. A parameter or weight of the model is going to be represented by a ternary system which essentially means that it can take one of these 3 values {-1, 0 and 1}.

Multiplications are replaced with additions for BitNet. Source: *The Era of 1-bit LLMs paper*

The results of this new variant look quite promising and the authors show how it matches the full-precision of FP16 or BF16 Transformer LLM in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption.

Performance vs Cost BitNet b1.58 vs FP16/BP16. Source: *The Era of 1-bit LLMs paper*

To summarise, The Era of 1-bit LLMs paper defines a new scaling law and way to train new LLMs that are both high performant but also cost effective. Furthermore, it may drive the design of specific hardware optimized for 1-bit LLMs.

Quick recap

We have looked into the concept of quantization and why this is important allowing LLMs to be stored at lower precision while keeping high performance, making LLMs more cost efficient. The Era of 1-bit LLMs takes it to the extreme drastically reducing costs and resource requirements. This approach facilitates easier LLM deployment and calls for the development hardware tailored for the unique requirements of 1-bit LLMs.

Give yourself a pat on the back for reading!

Enjoyed the read? Show some love with a few claps below 🙌

References

Generative AI: https://www.coursera.org/learn/generative-ai-with-llms
Era of 1-bit LLMs paper page: https://arxiv.org/abs/2402.17764