Pierre LienhartinTowards Data ScienceThe AQLM Quantization Algorithm, ExplainedIn this blog post, we cover the AQLM quantization algorithm which sets a new state-of-the-art for compressing LLMs down to 2 bits!Mar 132Mar 132

Pierre LienhartLLM Inference Series: 5. Dissecting model performanceIn this post, we look deeper into the different types of bottleneck that affect model latency and explain what arithmetic intensity is.Feb 23Feb 23

Pierre LienhartLLM Inference Series: 4. KV caching, a deeper lookIn this post, we will look at how big the KV cache, a common optimization for LLM inference, can grow and at common mitigation strategies.Jan 158Jan 158

Pierre LienhartLLM Inference Series: 3. KV caching unveiledIn this post we introduce the KV caching optimization for LLM inference, where does it come from and what does it change.Dec 22, 20237Dec 22, 20237

Pierre LienhartLLM Inference Series: 2. The two-phase process behind LLMs’ responsesAfter a quick reminder on the Transformer architecture, this post covers the algorithm of text generation using Transformer decoder models.Dec 22, 2023Dec 22, 2023

Pierre LienhartLLM Inference Series: 1. IntroductionIn this post, I introduce the outline of this deep dive series about the specifics and challenges of hosting LLMs for inference.Dec 22, 20231Dec 22, 20231