Member-only story
From Dense to Dynamic: NVIDIA’s Innovations in Upcycling LLMs to Sparse MoE

Sparse Mixture of Experts (MoE) models are gaining traction due to their ability to enhance accuracy without proportionally increasing computational demands. Traditionally, significant computational resources have been invested in training dense Large Language Models (LLMs) with a single MLP layer. A promising strategy to boost the capacity of such pre-trained models is through upcycling them into sparse MoE models, which allows for expansion without starting the training process from scratch. However, scalable upcycling methods are still a subject of research.
In a new paper Upcycling Large Language Models into Mixture of Experts, an NVIDIA research team introduces a new “virtual group” initialization technique to facilitate the transition of dense models into fine-grained MoE structures. They also propose a weight scaling method that delivers a 1.5% improvement in model loss for the upcycled MoE models.
The core idea behind upcycling is to harness the knowledge embedded in pre-trained dense language models and convert them into large MoE architectures, reducing both training time and computational expense. This transformation maximizes the utility of dense checkpoints while expanding the model’s capacity. To achieve this, the researchers devised the “virtual group” initialization technique, ensuring that every MLP shard has a distinct representation within the router’s topK when transitioning from a dense model to an MoE configuration.
Their findings show that upcycling outperforms continued dense model training for an equivalent amount of compute, as evidenced in both 2-billion and 15-billion parameter models. Depending on the target inference and available upcycling FLOPs, architectures like E8G1T2, which utilize more FLOPs, can deliver superior accuracy compared to dense iso-FLOP MoE models.
Additionally, the research highlights the need for distinct hyperparameter settings during…