Member-only story
NVIDIA’s nGPT: Revolutionizing Transformers with Hypersphere Representation
The Transformer architecture, introduced by Vaswani et al. in 2017, serves as the backbone of contemporary language models. Over the years, numerous modifications to this architecture have been proposed to enhance aspects such as training stability, inference efficiency, context length, and robustness.
In a new paper nGPT: Normalized Transformer with Representation Learning on the Hypersphere, an NVIDIA research team proposes the normalized Transformer (nGPT), which consolidates key findings in Transformer research under a unified framework, offering faster learning and reduced training steps — by factors ranging from 4 to 20 depending on sequence length.
The researchers summarize their main contributions as follows:
- Hypersphere-Based Normalization: The core advancement of nGPT lies in normalizing all embedding dimensions to reside on a unit hypersphere. This approach ensures consistent dimensionality across matrices and interprets matrix-vector multiplications as cosine similarities within the bounded range of [-1,1]. Notably, this normalization eliminates the need for weight decay by maintaining…