GPU Poors, Model Compression is Your Best Ally

A comprehensive guide to squeezing Large AI Models onto your humble hardware.

Published in

The Deep Hub

5 min readMar 7, 2024

Last week, we witnessed a major breakthrough in AI, The Era of 1-bit LLMs, pushing the limits the area of LLM quantization. Inspired by BitNet, a 1-bit model architecture, researchers discovered that a model with just three parameter weights ({-1,0,1}) can achieve compelling performance on multiple tasks. Therefore, I find this moment an excellent opportunity to summarize the main directions in model compressions techniques.

LLMs have become a mainstay in our daily lives since they have a remarkable performance and can streamline our work. However, they come with a huge downside: the need of resources. To put things in perspective, an LLM with 200 billion parameters requires around 350GB of storage at half precision (16 bits). This means having around 8 A100 GPUs at inference time just to get a summary of your favourite text, which you can’t be bothered reading. Indeed, I propose the best solution: model compression. By squeezing those bloated models down to a fraction of their original size, you can get similar performance without needing a supercomputer in your basement. Faster inference, less memory and lower carbon footprint .

Model Compression Outline | Made with Excalidraw

Pruning

Model pruning emerged in the last decade of the past century. Janoswky (1989) mentioned model clipping as a form of reducing the dimensionality of the model based on the magnitude of the weights. After that, Optimal Brain Surgeon proposed an iterative but unfeasible method of reducing the complexity of neural networks. In general, pruning can be categorized into unstructured, semi-structured pruning and structured pruning approaches.

Unstructured pruning consists of removing individual nodes or weights from a neural network without considering the overall structure of the model. We simply apply a threshold to identify and eliminate redundant parameters. Generally, magnitude pruning, which removes weights with the smallest absolute values, has been the go-to approach, and has shown decent performance. However, it has been shown that the influence of outlier weights plays a huge role during inference time, especially regarding large language models (LLMs). The two state of the art methods for unstructured pruning are WANDA and SparseGPT. Additionally, unstructured pruning compromises the model’s structure, which can make efficient computation and storage challenging. Irregular sparsity patterns resulting from unstructured pruning often require specialized hardware or software techniques, such as sparse matrix operations or compressed data formats, to fully leverage the benefits of compression during inference.

Hence, NVIDIA researchers came up with semi-structured pruning patterns. This approach involves removing N out of M contiguous weights, in patterns like 2:4 and 4:8. Although we can use the structure of the sparsity (NVIDIA’s sparse tensors) to improve speed, the accuracy of the model doesn’t hold. Nonetheless, algorithms such as LLM Surgeon or LLM-Pruner have shown the ability to prune entire layers with minimal degradation in accuracy.

Knowledge Distillation

Knowledge distillation is a valuable technique that aims to transfer the knowledge and capabilities of a large, complex model (referred to as the teacher model) to a smaller and more efficient model (known as the student model). This process involves transforming the comprehensive knowledge representation of the teacher model into a streamlined and effective format that can be learned by the student model.

In knowledge distillation for large language models (LLMs), there are two main approaches: white-box knowledge distillation and black-box knowledge distillation. White-box distillation allows the student model access to the teacher LLM’s internal parameters and representations for deeper knowledge transfer. A classic example is DistillBERT, which is a 40% smaller student model of BERT that retains 97% of BERT’s language comprehension capabilities while being 60% faster during inference.

On the other hand, black-box knowledge distillation relies solely on the predictions made by the teacher LLM. This approach has shown promising results in techniques like In-Context Learning distillation and Chain-of-Thought Prompting. One potential downside of knowledge distillation is that you still need access to the large teacher model to train the student, which may not be feasible due to resource constraints.

A key concept in knowledge distillation is KL divergence, which measures the difference between the probability distributions of the teacher and student model predictions. The loss function during student training aims to minimize this divergence, aligning the student’s predictions with the teacher’s. This mathematical concept plays a crucial role in the knowledge transfer process.

Quantization

Model quantization has emerged as a widely adopted technique in the domain of model compression, aimed at alleviating the substantial storage and computational demands of large language models (LLMs). Unlike traditional floating-point number representations, quantization converts the model parameters to integers or other discrete forms. This transformation significantly reduces the storage requirements and computational complexity of LLMs, while carefully designed quantization techniques can achieve substantial model compression with only minimal accuracy degradation.

Quantization approaches for LLMs can be broadly categorized into two main strategies: quantization-aware training (QAT) and post-training quantization (PTQ). QAT seamlessly integrates the quantization objective into the model’s training or fine-tuning process, enabling the LLM to adapt to low-precision representations during training. This adaptation aims to preserve higher performance even after quantization. Techniques like PEQA or QLORA fall under the QAT category, with LLM-QAT achieving remarkable results by distilling large LLaMA models down to just 4-bit quantized weights and key-value caches.

On the other hand, PTQ involves quantizing the parameters of an LLM after its training phase is complete. The primary objective of PTQ is to reduce the storage and computational complexity of the LLM without requiring architectural modifications or extensive retraining. While PTQ introduces some precision loss due to quantization, it offers a straightforward way to enhance the efficiency of an LLM without significant alterations or training efforts. Approaches like LLM.int8(), and GPTQ fall under the PTQ category, exploring weight-only quantization, mixed-precision techniques, and layer-wise quantization strategies.

Recent research efforts have focused on pushing the limits of quantization for LLMs, aiming to achieve higher compression rates while minimizing accuracy degradation. With the sudden appearance of 1.58-bit quantization, we have unlocked a potential opportunity to build a new hardware that can be used for inference without the need for matrix multiplication.

Low-Rank Factorization

The main idea consists of finding a factorization of the weight matrix W into two matrices U and V with lower dimensionality. In this way, we can approximate the original weights with matrices that have fewer parameters, thus reduce computational costs.

This idea has been applied for fine-tuning LLMs efficiently (PEFT). For instance, LoRA finds a low rank factorization and tunes these smaller matrices to the desired task outputting an adapter that can be merged into the model. Alternatively, TensorGPT stores large embeddings in a low-rank tensor and allows the use of GPTs on edge devices.

Conclusion

The balance between Large Language Models performance and model size has long been studied, and although compressing them might harm their performance, it might be useful for applying a model in more niche conditions like robotics. But when it comes to a model designed for writing legal contracts, does it truly add value for it to have knowledge about physics? Up to a certain point?