Nov 28, 2021Scaling with Sparsity: How Switch Transformers Scale to a Trillion Parameters using Sparsely-Activated Expert ModelsModern era deep learning models, especially those used in large-scale natural language processing (NLP), aim to achieve better performance by increasing the size (or parameter count) of the model, given sufficiently large training datasets. …Switch Transformers13 min readSwitch Transformers13 min read