Published inTDS ArchiveThe AQLM Quantization Algorithm, ExplainedIn this blog post, we cover the AQLM quantization algorithm which sets a new state-of-the-art for compressing LLMs down to 2 bits!Mar 13, 2024A response icon2Mar 13, 2024A response icon2
LLM Inference Series: 5. Dissecting model performanceIn this post, we look deeper into the different types of bottleneck that affect model latency and explain what arithmetic intensity is.Feb 2, 2024A response icon10Feb 2, 2024A response icon10
LLM Inference Series: 4. KV caching, a deeper lookIn this post, we will look at how big the KV cache, a common optimization for LLM inference, can grow and at common mitigation strategies.Jan 15, 2024A response icon12Jan 15, 2024A response icon12
LLM Inference Series: 3. KV caching unveiledIn this post we introduce the KV caching optimization for LLM inference, where does it come from and what does it change.Dec 22, 2023A response icon15Dec 22, 2023A response icon15
LLM Inference Series: 2. The two-phase process behind LLMs’ responsesAfter a quick reminder on the Transformer architecture, this post covers the algorithm of text generation using Transformer decoder models.Dec 22, 2023A response icon2Dec 22, 2023A response icon2
LLM Inference Series: 1. IntroductionIn this post, I introduce the outline of this deep dive series about the specifics and challenges of hosting LLMs for inference.Dec 22, 2023A response icon3Dec 22, 2023A response icon3