From Dense to Dynamic: NVIDIA’s Innovations in Upcycling LLMs to Sparse MoE
Sparse Mixture of Experts (MoE) models are gaining traction due to their ability to enhance accuracy without proportionally increasing computational demands. Traditionally, significant computational resources have been invested in training dense Large Language Models (LLMs) with a single MLP layer. A promising strategy to boost the capacity of such pre-trained models is through upcycling them into sparse MoE models, which allows for expansion without starting the training process from scratch. However, scalable upcycling methods are still a subject of research.
In a new paper Upcycling Large Language Models into Mixture of Experts, an NVIDIA research team introduces a new “virtual group” initialization technique to facilitate the transition of dense models into fine-grained MoE structures. They also propose a weight scaling method that delivers a 1.5% improvement in model loss for the upcycled MoE models.
The core idea behind upcycling is to harness the knowledge embedded in pre-trained dense language models and convert them into large MoE architectures, reducing both training time and computational expense. This…