Andrei ApostolinMantisNLPKnowledge Distillation — Techniques for Efficient Inference of LLMs (IV/IV)Welcome to the fourth and final part of our blog series on techniques for efficient inference of LLMs. In this segment, we explore…Nov 8, 2023Nov 8, 2023
Andrei ApostolinMantisNLPFlashAttention — Techniques for Efficient Inference of LLMs (III/IV)Last time on this series we discussed about pruning (removing useless weights in a network) and paged attention (optimizing memory access)…Nov 1, 20231Nov 1, 20231
Andrei ApostolinMantisNLPTechniques for Efficient Inference of LLMs (II/IV)Last time we talked about quantization, a compression technique used to reduce the bitwidth of neural networks by representing the weights…Oct 18, 2023Oct 18, 2023
Andrei ApostolinMantisNLPTechniques for Efficient Inference of LLMs (I/IV)IntroductionOct 4, 2023Oct 4, 2023