Quantizing Large Language Models on Your Laptop

Layer-Wise Low-Bit Weight-Only Quantization

Intel(R) Neural Compressor

Published in

Intel Analytics Software

5 min readOct 23, 2023

Heng Guo, Yiyang Cai, Wenhua Cheng, and Haihao Shen, Intel Corporation

Large language models (LLMs) undergo training on extensive datasets and may encompass billions of parameters. Their sophisticated network architectures, coupled with their substantial parameter count, allow them to effectively comprehend the intricacies of natural language. After the initial training phase, an LLM can be subjected to fine tuning for a diverse array of downstream applications in natural language processing and generation, e.g.: conversational chatbots (e.g., ChatGPT), machine translation, text classification, fraud detection, and sentiment analysis.

The complexity of LLMs arises from challenges associated with the AI and memory wall. Specifically, computational capabilities improve by a factor of 3.1x every two years, but memory bandwidth improves by only 1.4x during the same timeframe. Furthermore, the training of LLMs necessitates the use of distributed systems, which are subject to network bandwidth limitations. When these models are eventually deployed, they are often placed on systems with constrained computational and memory capacities. Therefore, reducing LLM size through post-training quantization is crucial to enabling low-latency inference.

Weight-Only Quantization

Compared to normal quantization methods like W8A8, weight-only quantization is probably a better tradeoff to balance computational performance and model accuracy because, as we will see below, memory bandwidth is the main bottleneck in LLM deployment.

Broadly speaking, two steps are required to invoke a model. The first is moving the model from memory to cache, where memory bandwidth (B) and parameter count (P) are the key factors. The theoretical time cost is P*4/B. The second step is computation, in which the device’s computation capacity (C) measured in floating-point operations per second (FLOPS) and the forward FLOPS (F) are critical. The theoretical computational cost is F/C.

The most famous application of LLMs is text generation, which predicts the next token/word in a sequence based on the input/context. F is roughly proportional to P, but the C/B of modern devices could be to 100x, which makes memory bandwidth the bottleneck in this scenario. Activation quantization can reduce model accuracy so weight-only quantization is preferred for text generation.

Round-to-nearest (RTN) is the most straightforward way to quantize weight using scale maps. However, when the number of bits is small, the MSE loss is larger than expected. A group size is introduced to reduce elements using the same scale to improve accuracy. RTN does not require additional datasets and is a very fast quantization method. It converts the weight into a uniformly distributed integer data type, though some algorithms (e.g., QLoRA) propose a non-uniform NF4 data type and prove its theoretical optimality.

GPTQ is a new one-shot weight quantization method based on approximate second-order information that is both accurate and efficient. The weights of each column are updated based on the fixed-scale pseudo-quantization error and the inverse of the Hessian matrix calculated from the activations. The updated columns sharing the same scale may generate a new max/min value, so the scale is saved for restoration.

Intel Neural Compressor integrates these popular weight-only quantization algorithms.

Layer-Wise Quantization

Layer-wise quantization (LWQ) can greatly reduce the memory footprint of LLMs, usually 80–90%, which means that users can quantize LLMs even on a single CPU, GPU, or memory-constrained device (Figure 1). We will use a laptop computer in our demonstration below.

Different from other methods, we load an empty shell model that only contains the structure information and no initial parameters. Therefore, compared to loading the whole model, the memory requirements for reading this shell model are very low, usually only a few hundred MB. For most LLMs, the weights are stored in one or several large binary files. To maximize memory savings, we rewrote the load function so that we can load one specified tensor weight from these large binary files or checkpoints. During quantization, for each layer in the model ordered by forward processing, we only load the required parameters from disk, reset to the empty layer, register pre-forward and forward hooks, then use PTQ or weight-only quantization to quantize the weights. Afterwards, the results are stored to disk. This process is repeated until all model layers have been quantized.

Experiments on a Laptop

Although quantization can compress model size, memory consumption during quantization is high. For traditional quantization methods, working on a memory-constrained device like a laptop is practically impossible. Our LWQ method, however, makes it possible. We will demonstrate on a laptop with an 11th Generation 3.0 GHz Intel i7–1185G7 with 16.0 GB of memory. The following code shows how to use Intel Neural Compressor to leverage LWQ:

from neural_compressor import PostTrainingQuantConfig, quantization
from neural_compressor.adaptor.torch_utils.layer_wise_quant import load_shell

fp32_model = load_shell(model_name_or_path, torchscript=True)

conf = PostTrainingQuantConfig(
approach="weight_only",
recipes={
"layer_wise_quant": True,
"rtn_args": {"enable_full_range": True},
},
)

q_model = quantization.fit(
fp32_model,
conf,
calib_dataloader=eval_dataloader,
eval_func=lambda x: 0.1,
)

ouput_dir = "./saved_model"
q_model.save(ouput_dir)
q_model = load(ouput_dir, deepcopy(fp32_model), weight_only=True, layer_wise=True)

We applied normal GPTQ and GPTQ with LWQ to Llama13b on our laptop. The Llama13b parameter file is about 38 GB, which is much larger than the available memory so the process died while attempting to load the model (Figure 2). LWQ reads parameters layer-by-layer, and only when they are actually needed during quantization, so the maximum memory footprint is approximately equal to the largest layer (Figure 3).

*Figure 2. Running normal GPTQ on the laptop, out of memory*

*Figure 3. Running GPTQ with LWQ on the laptop, no problem*

We evaluated the accuracy (Table 1) and memory usage (Table 2 and Figure 4) of RTN with LWQ on few models.

*Figure 4. RTN GPT-J-6B with and without LWQ memory consumption*

LWQ optimized the parameter reading process, but has no impact on parameters and quantification algorithms. Therefore, the results with or without LWQ should be the same. Our experimental confirm that LWQ can greatly reduce memory consumption, approximately 90% for LLMs, without loss of accuracy.

Summary

We have released the source code for Intel Neural Compressor. We encourage you to try it out on your laptop and explore other Intel AI tools and optimizations as part of your AI workflows. Please add a star to Intel Neural Compressor repository if you find it useful. You are also welcome to create pull requests or submit issues to the repository. Feel free to contact us if you have any questions.