Member-only story
NVIDIA’s Hybrid: Combining Attention and State Space Models for Breakthrough Performance of Small Language Models
Language models (LMs) based on transformers have become the gold standard in natural language processing, thanks to their exceptional performance, parallel processing capabilities, and ability to retain long-term context via key-value (KV) caches. However, these benefits come at a cost — transformers require quadratic computational resources and large memory footprints, presenting significant efficiency challenges. On the other hand, state space models (SSMs), such as Mamba, boast constant computational complexity and hardware-friendly design, but they struggle with memory recall, which hampers their performance on diverse language tasks.
To address the abovementioned issues, in a new paper Hymba: A Hybrid-head Architecture for Small Language Models, an NVIDIA research team proposes Hymba, a family of small language models that employ a hybrid-head parallel architecture. By blending transformer attention mechanisms with state space models (SSMs), Hymba achieves superior efficiency and performance. Notably, it outperforms the Llama-3.2–3B model with a 1.32% higher average accuracy, while reducing cache size by 11.67× and increasing throughput by 3.49×.