Understanding Compression of Large Language Models (LLMs)

7 min readNov 22, 2023

Large Language Models (LLMs), that are part of the artificial intelligence (AI) toolkit, have proven themselves to be more accurate, versatile and effective in interacting with humans in “natural language”. LLMs today are able to handle Question and Answering, translating from one language to another, creating new contents (poems!), do sentiment analysis and much more. In other words, LLMs have revolutionized Natural Language Processing (NLP — understanding as well as generating) enabling AI to mimic humans.

According to the scaling laws, the larger the model, the better it performs. Today LLMs typically have over 100 billion parameters and the latest ones have even touched trillion parameters. But the application of these pre-trained LLMs in real-life, for inference, is impeded by the computational requirements (multiple GPUs and GBs of memory) and rules out their use in mobile devices (which is where the most impact is expected). Add to this the prohibitive cost and environmental concerns (resource usage — energy, water as well the resulting carbon emissions).

The model compression addresses these issues and aims at making the LLMs accessible in mobile devices, with limited resources and costs, and faster inferences make real-life deployments feasible.

Model Compression Techniques

While there are quite a few model compression techniques (including the ones that predate LLMs), the key ones are:

1. Pruning — Pruning is a powerful technique that reduces the size or complexity of the model by removing unnecessary or redundant components.

2. Quantization — Quantization is the process that reduces the precision of the model’s parameters to significantly reduce the size of the model.

3. Knowledge Distillation — Distillation is the process of training a smaller model (a student) the mimic the behavior of a larger model (a teacher).

And it is found that judiciously combining these techniques leads to more compressed models that generalize better.

Pruning

Pruning is the process of identifying unnecessary parameters — that have little impact or redundant — and removing them. Sparsity — one form of pruning — is the process of replacing near zero values by zero and represent the matrix in a condensed form (only non-zero values and indices) that takes up less space than a full, dense matrix.

There are two types of Pruning:

Structured Pruning — Structured pruning reducing the model size by removing entire structural components, like neurons, channels or layers. This leads to significant reductions in the size of the model while maintaining the overall LLM structure intact. Compared to unstructured pruning, structured pruning offers more control and scales well for larger models.
Unstructured Pruning — Unstructured pruning is a simple technique that targets individual weights or neurons by applying a threshold and zeroing out parameters below it. It does not consider the overall LLM structure and results in an irregular sparse model required specialized techniques.

Unstructured pruning often requires additional fine-tuning (retraining) to regain accuracy. For massive models with billions of parameters, unstructured pruning can become inefficient and time-consuming. Various techniques like iterative fine-tuning during pruning to minimize the training steps, combining parameter-efficient tuning (PEFT) with pruning and SparseGPT are used to address this issue.

SparseGPT uses a one-shot pruning strategy to eliminate the need for retraining. It frames pruning as a sparse regression task and uses an approximate spare regression solver (i.e., it doesn’t try to find the exact solution but tries to find the solution that is good enough). This makes SparseGPT very efficient.

SparseGPT achieves significant unstructured sparsity on large GPT models, even up to 60% on OPT-175B and BLOOM-176B (higher than the sparsity achieved with structured pruning) with minimal increase in perplexity (a measure of a model’s ability to predict the next word in a sequence).

Quantization

Quantization is a technique that reduces the precision of the model’s weights (and sometimes also the activations) to significantly reduce the size of the model — leading to reduced storage and bandwidth requirement, and computational complexity. The typical 32-bit representation of the model is converted to either 4-bit integers, or 8-bit floating point numbers or 16-bit floating point numbers.

Although the reduction of precision results in loss of accuracy, it is proven that with careful calibration substantial model compression with minimal accuracy degradation is achievable.

To understand the types of quantization (that involves retraining or not), it is important to remember that the only values that are stored in a pre-trained model are the weights and biases of the neurons. The activation values (intermediate outputs) that are calculated during the forward pass of the model, are not persisted.

There are two types of Quantization techniques:

1. Post-training Quantization (PTQ) — PTQ involves quantizing the parameters after the completion of the LLM training phase. The objective is to reduce the model size without altering the LLM architecture and without retraining. PTQ s quick and easy but introduces a certain degree of precision loss. PTQ can be of many types (ref: Post-training quantization | TensorFlow Lite):

a. Full Integer Quantization — this method enables compatibility with integer only hardware devices by quantizing all the model parameters to integer. As you need to estimate the range (min, max) of all the parameters, you need to provide a representative dataset to calibrate. Unlike weights and biases which are constant, the model input and activations (outputs of intermediate layers) are variable. So, you first need to run a few inference cycles to get the range of these variable parameters. It reduced the model by 4x and provides 3x+ speedup.

b. Dynamic Range Quantization — statically quantizes the weights to integers at conversion time and dynamically quantizes the activations based on their range during inference. This method does not need a representative dataset for calibration. It reduced the model by 4x and provides 2x-3x speedup.

c. Float16 quantization — You reduce the model size by quantizing the weights to float16. While it causes minimal loss of accuracy compared to integers, it only reduces the model size by half and has higher latency. When run in CPU, the values will be dequantized to float32, this approach is more suitable for GPU.

2. Quantization-aware Training (QAT) — QAT is a more complex type of quantization that involves integrating the quantization objective in the model training process. In QAT, the LLMs adapts to low precision representations during training that leads to better accuracy compared to PTQ. In addition to quantizing weights and activations, QAT also quantizes key-value (KV) caches — that stores frequently accessed information related to the model’s responses and internal states, allowing the LLM to efficiently retrieve and reuse them when processing new input.

The quantization involves reducing precision of weights, or both weights and activations (also referred to as “full quantization”). Quantizing weights alone is straight-forward as they are fixed after training. But as the activations remain at higher precisions and GPUs cannot handle multiplying both, the weights have to be converted to higher precision.

While the answer seems to be “quantize the activations also”, the challenge is that activation vectors typically have outliers increasing their range making it difficult to represent at a lower precision. One option is to use dynamic quantization where during inference certain activations are represented at a higher precision level than others. Dynamic quantization leads to better accuracy but computationally expensive. The mixed precision quantization approach that exploits model redundancy by assigning lower bit-width to the less useful layers tries to balance between efficiency and accuracy.

Knowledge Distillation

Knowledge distillation (KD) is a technique aimed at transferring knowledge from a large, complex model (the “teacher”) to a smaller, simpler model (the “student”) with a smaller footprint. The student is trained to mirror the teacher — by using an additional loss function that measures the discrepancy between their outputs (this is in addition to the original loss function that matches with the ground-truth labels). And the current trend is Ensemble KD where multiple teachers are used to train the student.

KD was introduced in the paper “Distilling the Knowledge in a Neural Network” by Hinton et al. in 2015. Since then, KD has been successfully implemented in NLP, Computer Vision and Speech recognition. One such example is DistilBERT — smaller, faster, and lighter version of the BERT created using knowledge distillation — that retained 93–97% of the NLP capabilities of BERT while being 40% smaller and 60% faster.

Ensemble knowledge distillation has proven to be quite effective and is being used in various tasks like:

1. compress large image classification models, such as VGGNet and ResNet, into smaller models that can be deployed on mobile devices.

2. compress large language models, such as BERT and GPT-2, into smaller models that can be used for a variety of NLP tasks

3. compress large speech recognition models into smaller models that can be deployed on embedded devices.

There are two types of Knowledge Distillation techniques:

1. Black-box KD — the student model has access only to the predictions made by the teacher model. This approach is challenging as the inner workings of the teacher model is not accessible and the learning is indirect and entirely depending on the outputs (predictions or logits).

2. White-box KD — in addition to the predictions, the student model has access to the inner workings of the teacher model (its parameters, activations and intermediate representations) leading to an effective knowledge transfer. Compared to black-box KD, white-box KD delivers a better performance, but is more complex and difficult to implement.

It is important to note that many state-of-the-art large language models (LLMs) have restrictive licenses that prohibit using their outputs to train other LLMs. Open source LLMs, limited-use licenses and synthetic data are the alternatives to find teacher models.

Conclusion

Large language models (LLMs) are becoming larger and complex, making it difficult for real-time applications and deployment in smaller devices. Model compression techniques help to minimize LLM size and computing requirements with some impact on the accuracy and generalization capability. Distillation moves knowledge from a large LLM to a smaller one, quantization decreases parameter precision, and pruning removes unnecessary or superfluous parameters. And it is found that judiciously combining these techniques leads to more compressed models that generalize better.

Understanding Compression of Large Language Models (LLMs)

Written by Sasirekha Cota