Compression Techniques for LLMs
A survey
1 min readAug 30, 2023
I wrote several articles in The Kaitchup about LLM quantization but quantization methods are not the only ones that can reduce model size. We have:
- Quantization: Convert the model weights to a lower precision.
- Pruning: Removing redundant parameters with little to no effect on the model performance.
- Low-rank factorization: Approximate a given weight matrix by decomposing it into two or more smaller matrices with significantly lower dimensions.
- Knowledge distillation: Transforming the knowledge of a teacher model into a more streamlined and effective representation. For instance, when an LLM is trained on ChatGPT outputs, this is knowledge distillation.
If you haven’t read my articles applying quantization to LLMs, such as Llama 2, for fine-tuning and inference, I recommend the following ones: