Compression Techniques for LLMs

A survey

1 min readAug 30, 2023

--

I wrote several articles in The Kaitchup about LLM quantization but quantization methods are not the only ones that can reduce model size. We have:

Quantization: Convert the model weights to a lower precision.
Pruning: Removing redundant parameters with little to no effect on the model performance.
Low-rank factorization: Approximate a given weight matrix by decomposing it into two or more smaller matrices with significantly lower dimensions.
Knowledge distillation: Transforming the knowledge of a teacher model into a more streamlined and effective representation. For instance, when an LLM is trained on ChatGPT outputs, this is knowledge distillation.

Zhu et al. wrote a survey explaining these methods.

Figure by Zhu et al.

If you haven’t read my articles applying quantization to LLMs, such as Llama 2, for fine-tuning and inference, I recommend the following ones:

GPTQ or bitsandbytes: Which Quantization Method to Use for LLMs - Examples with Llama 2

Large language model quantization for affordable fine-tuning and inference on your computer

kaitchup.substack.com

Quantize and Fine-tune LLMs with GPTQ Using Transformers and TRL

GPTQ is now much easier to use

kaitchup.substack.com

Benjamin Marie

Written by Benjamin Marie

Ph.D, research scientist in NLP/AI. Medium "Top writer" in AI and Technology. Exclusive articles and all my AI notebooks on https://kaitchup.substack.com/

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams