Building LLM Applications: Serving LLMs (Part 9)

50 min readApr 18, 2024

--

Learn Large Language Models ( LLM ) through the lens of a Retrieval Augmented Generation ( RAG ) Application.

Posts in this Series

Table Of Contents

· 1. Run LLMs locally
∘ 1.1. Open-source LLMs
· 2. Load LLMs Efficiently
∘ 2.1. HuggingFace
∘ 2.2. LangChain
∘ 2.3. Llama.cpp
∘ 2.4. Llamafile
∘ 2.5. Ollama
∘ 2.6. GPT4ALL
∘ 2.7. Sharding
∘ 2.8. Quantize with Bitsandbytes
∘ 2.9. Pre-Quantization (GPTQ vs. AWQ vs. GGUF)
· 3. Inference Optimization
· 4. Understanding LLM inference
∘ 4.1. Prefill phase or processing the input
∘ 4.2. Decode phase or generating the output
∘ 4.3. Request batching
∘ 4.4. Continuous batching
∘ 4.5. PagedAttention: A Memory-Centric Solution
∘ 4.6. Key-value caching
∘ 4.6.1. LLM memory requirement
· 5. Scaling up LLMs with model parallelization
∘ 5.1. Pipeline parallelism
∘ 5.2. Tensor parallelism
∘ 5.3. Sequence parallelism
· 6. Optimizing the attention mechanism
∘ 6.1. Multi-head attention
∘ 6.2. Multi-query attention
∘ 6.3. Grouped-query attention
∘ 6.4. Flash attention
∘ 6.5. Efficient management…

Vipra Singh

Written by Vipra Singh

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams