Compression Techniques for LLMs

A survey

Benjamin Marie
1 min readAug 30, 2023

I wrote several articles in The Kaitchup about LLM quantization but quantization methods are not the only ones that can reduce model size. We have:

  • Quantization: Convert the model weights to a lower precision.
  • Pruning: Removing redundant parameters with little to no effect on the model performance.
  • Low-rank factorization: Approximate a given weight matrix by decomposing it into two or more smaller matrices with significantly lower dimensions.
  • Knowledge distillation: Transforming the knowledge of a teacher model into a more streamlined and effective representation. For instance, when an LLM is trained on ChatGPT outputs, this is knowledge distillation.

Zhu et al. wrote a survey explaining these methods.

Figure by Zhu et al.

--

--

Benjamin Marie

Ph.D, research scientist in NLP/AI. Medium "Top writer" in AI and Technology. Exclusive articles and all my AI notebooks on https://kaitchup.substack.com/