What are 1-bit LLMs?

The Era of 1-bit LLMs with BitNet b1.58

Mehul Gupta
Data Science in your pocket

--

https://arxiv.org/abs/2402.17764

The Generative AI world is racing and the new addition to this fast-evolving space is 1-bit LLMs. You might not believe it, but this can change a lot of things and can help eliminate some of the biggest challenges associated with LLMs, especially their huge size.

My debut book on LangChain is out now !!

In a general scenario (not always), the weights of any Machine Learning model, be it LLMs or Logistic Regression, are stored as 32-bit floating points or 16-bit floating point

This is the root cause of why we aren’t able to use models like GPT or other bigger models in local systems and production as these models have a very high number of weights leading to huge sizes due to high precision value for weights.

So assume, we have an LLM “MehulGPT” which has 7B parameters (similar to Mistral or Llama-7B) using 32bit precision (4 bytes). This model will occupy

  • Total Memory = Size of one weight * Number of weights
    Total Memory = 4 bytes * 7,000,000,000
    Total Memory = 28,000,000,000 bytes
  • Converting this to gigabytes (GB), we get:
    Total Memory = 28,000,000,000 bytes / 1024³ bytes per GB
    Total Memory ≈ 26.09 GB

This is huge and eventually eliminates several devices that can’t use it including mobile phones as they don’t have this big storage or hardware capacity to run these models.

Hence how to enable LLMs for smaller devices and mobile phones?

1-bit LLMs

In 1-bit LLMs, only 1 bit (i.e. 0 or 1) is used to store the weight parameters compared to 32/16 bits in traditional LLMs. This reduces the overall size by a big percentage hence enabling even smaller devices to use LLMs. Let’s assume the 1-bit LLM variant of “MehulGPT”. This time the memory occupied is

  • Total Memory = Size of one weight * Number of weights
    Total Memory = 0.125​ bytes * 7,000,000,000
    Total Memory = 875,000,000 bytes
  • Converting this to gigabytes (GB), we get:
    Total Memory = 875,000,000 bytes / 1024³ bytes per GB
    Total Memory ≈ 0.815 GB

1 bit = 0.125 Bytes

Hence a lot of computational and storage resources are saved.

Is this similar to quantization?

For those who don’t know quantization, it is the method of reducing the model size by decreasing the precision of weights, say, from 32 bits to 8 bits, hence reducing size by 4 times. The lower the bits used, the smaller the size but a hit on the performance as well.

1-bit LLMs are similar to the idea of quantization but with a difference. In the case of quantization, we decrease the precision (so if a weight value was 2.34567890656373…, it may get reduced to 2.3456).

In 1-bit LLM, every weight will get represented by the binary operator (0,1) and nothing else, hence an even more reduced model. Some major architectural changes are done so that the performance doesn’t take a hit compared to traditional LLMs.

BitNet b1.58

The first-of-its-kind 1-bit LLM, BitNet b1.58 right now uses 1.58 bits per weight (and hence not an exact 1-bit LLM) where a weight can have 3 possible values (-1,0,1).

For 1.58-bit, 1) Weights have only -1,0,1 values 2). As values are just -1,0,1, No multiplication is required

As claimed in the paper:

BitNet b1.58 matches 16-bit Floating point LLM baselines in perplexity and end-task performance.

It provides faster processing speeds and uses less GPU memory compared to traditional models.

The model minimizes multiplication operations for matrix multiplication, enhancing optimization and efficiency.

Includes a quantization function for system-level optimization and integrates components like RMSNorm and SwiGLU similar to LLaMA.

Note: I’m avoiding the jargon mentioned above for the moment as explaining them will require another post

The model hasn’t been made public right now and hence not tested by general folks. But yes, this looks to be quite promising and if the claims are true, we are in for a treat !!

--

--