Andrei ApostolinMantisNLPKnowledge Distillation — Techniques for Efficient Inference of LLMs (IV/IV)Welcome to the fourth and final part of our blog series on techniques for efficient inference of LLMs. In this segment, we explore…6 min read·Nov 8, 2023----
Andrei ApostolinMantisNLPFlashAttention — Techniques for Efficient Inference of LLMs (III/IV)Last time on this series we discussed about pruning (removing useless weights in a network) and paged attention (optimizing memory access)…6 min read·Nov 1, 2023--1--1
Andrei ApostolinMantisNLPTechniques for Efficient Inference of LLMs (II/IV)Last time we talked about quantization, a compression technique used to reduce the bitwidth of neural networks by representing the weights…10 min read·Oct 18, 2023----
Andrei ApostolinMantisNLPTechniques for Efficient Inference of LLMs (I/IV)Introduction8 min read·Oct 4, 2023----