CMU & Meta’s TriForce: Turbocharging Long Sequence Generation with 2.31× Speed Boost on A100 GPU
Large language models (LLMs) endowed with long-context capabilities, such as GPT-4 and Gemini, are increasingly finding versatile applications in various domains like chatbots, vision generation, and financial analysis. However, their efficacy is hampered by the inefficient utilization of computational resources and a substantial memory footprint, particularly when tasked with generating long sequences.
Addressing these challenges, in a new paper TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, a research team from Carnegie Mellon University and Meta AI introduces TriForce — a hierarchical speculative decoding system tailored for scalable long sequence generation. TriForce not only achieves remarkable speedups for models like Llama2–7B-128K, reaching up to 2.31× on an A100 GPU, but also demonstrates scalability in handling even lengthier contexts.
The researchers identified three crucial insights that guided the development of TriForce:
- Hierarchical Speculation for Dual Memory Bottlenecks: Recognizing two primary memory bottlenecks — model weights and key-value (KV) cache — the team observed that as context length increases, the latter gradually becomes the dominant bottleneck. This led them to employ hierarchical speculation, addressing these bottlenecks sequentially with different draft models.
- Leveraging Attention Sparsity for Speculative Decoding: By identifying significant redundancy within the KV cache, the researchers found that a small portion of it is adequate to achieve a high acceptance rate. They utilized partial KV cache as a draft cache for self-speculation, capitalizing on attention sparsity.
- Exploiting Contextual Locality for Drafting Efficiency: Discovering that adjacent tokens often require similar information from long context tokens, the team leveraged this contextual locality to enhance drafting efficiency.
Building upon these insights, TriForce employs retrieval-based drafting and hierarchical speculation to effectively tackle the identified bottlenecks. It utilizes the original model weights and dynamic sparse KV cache via retrieval as a draft model, serving as an intermediate layer in the hierarchy, further speculated by a smaller model to reduce drafting latency.
TriForce’s performance speaks volumes: achieving notable speedups for Llama2–7B-128K, up to 2.31× on an A100 GPU, and showcasing scalability in handling even longer contexts. In an offloading setting on two RTX 4090 GPUs, TriForce achieves a token generation speed of 0.108s/token — only half as slow as the auto-regressive baseline on an A100, which attains 7.78× on the optimized offloading system. Furthermore, TriForce outperforms DeepSpeed-Zero-Inference on a single RTX 4090 GPU by 4.86×. These achievements underscore TriForce’s potential to revolutionize the serving of long-context models for extensive sequence generation.
The paper TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding is on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.