Two minutes NLP — Scaling Transformers with Sparsity

How sparsity can reduce the computational complexity of dense Transformers and make predictions faster

Fabio Chiusano

Follow

Published in

NLPlanet

5 min readMar 22, 2022

--

Hello fellow NLP enthusiasts! Today I want to share with you an interesting article that investigates ways to make large language models more manageable for our everyday use cases. This is important since not everybody has the resources to train these large models, as well as making predictions with them. Studies like this will maybe bring us to the point where we’ll be able to host a model like GPT-3 on our laptop. Enjoy! 😄

Nowadays large language models reached impressive results in many NLP tasks, mainly due to the rise of Transformer models like BERT, T5, and GPT-3. Transformer showed their ability to model natural language better and better as their amount of parameters rises.

Now, can you think of a problem with this trend?

The benefits of this progress are undercut by the huge costs such models incur. These models are very expensive to train (and, sometimes, even to fine-tune) and they require specialized hardware for making fast predictions as well, which limits their real-world use cases. With the growing popularity and size of these models, it is increasingly valuable to make them scale efficiently.

Sparsity to the rescue!

The paper “Sparse is Enough in Scaling Transformers” addresses these problems and proposes a new family of Transformers that fully leverages sparsity whenever possible, called Scaling Transformers.

Scaling Transformers are really interesting because they allow scaling language models efficiently and perform unbatched decoding much faster than the standard Transformer as we scale up the model size. To put it into perspective:

Let’s call d the number of parameters of a Transformer model.
Then, a standard dense Transformer would need approximately d^2 computations to make a prediction.
Instead, a sparse Scaling Transformer would need approximately d^1.5 computations.

If this doesn’t look like such an improvement to you, consider that d is usually a very high number in the order of the billions, indeed experiments showed that Scaling Transformer brings a nearly 20x prediction speedup for a single token (from 3.690s to 0.183s) with respect to a dense Transformer with 17B parameters. Keep in mind that these speedups are for unbatched predictions.

Moreover, it turns out that Scaling Transformers achieve the same log-perplexity as the standard dense Transformer on pertaining, with the same amount of parameters: this fact gives the title “Sparse is Enough” to the paper, i.e. that a sparse model is able to achieve the performance of a dense model with the same amount of parameters. Not a trivial outcome I would say!

Let’s dig into a few details. Where can sparsity be implemented into Transformers?

Sparsity in Transformers

The paper proposes Scaling Transformers with a separate sparse mechanism for the query, key, value, and output layers (QKV layers for short) and combines it with sparse feedforward blocks to get a fully sparse Transformer architecture.

Log-perplexity of Scaling Transformers (equivalent to T5 large with approximately 800M parameters) on C4 dataset with proposed sparsity mechanisms (FF, QKV, FF+QKV) is similar to baseline dense model. Image from https://arxiv.org/pdf/2111.12763.pdf.

Decoding speed (in seconds) of a single token. For Transformer model (equivalent to T5 large with approximately 800M parameters), Scaling Transformers with proposed sparsity mechanisms (FF+QKV) achieve up to 2x speedup in decoding compared to baseline dense model and 20x speedup for 17B param model. Image from https://arxiv.org/pdf/2111.12763.pdf.

The above gains from sparsifying are very good. However, they would be worse when decoding with longer sequences, as the decoding time will be dominated by attention operations.

Luckily, a number of methods have been proposed to solve this problem for Transformers, such as LSH (Locality-Sensitive Hashing) attention to handling long sequences and reversible layers for memory efficiency.

The authors implemented them to further improve Scaling Transformers and thus obtain the Terraformer model. Terraformer has then been pre-trained on C4 and fine-tuned on summarization of arxiv articles, yielding competitive results to the state-of-the-art BigBird-Pegasus.

Decoding speed of a single token for Terraformer with 17B parameters is 37x faster than a dense baseline model, requiring less than 100ms/token for inference. Here attention-sparsity = 64, ff-sparsity = 256, and loss-sparsity = 4. Image from https://arxiv.org/pdf/2111.12763.pdf.

Other techniques for improving Transformers efficiency

The paper has also an interesting overview of other techniques used to make Transformers more efficient. I report some excerpts of it here, I think it can be useful as a reference for those not already familiar with Transformer efficiency techniques.

Model compression. Model pruning makes matrices smaller by removing unneeded weights after or during training.
Model distillation. Model distillation consists in training a small model (i.e. the student) on the outputs of a previously-trained large model (i.e. the teacher). Several natural language models used for mobile inference rely on distillation to speed up inference from the pre-trained large models.
Sparse attention. Sparse attention-based approaches have made the attention layer more efficient, especially for long sequences, by incorporating additional combinatorial mechanisms or selecting a subset of tokens this layer attends to.
Sparse feedforward. The key idea is to partition the feed-forward layer into parts (called experts) and retrieve only one part per token, which reduces the complexity of the feedforward block. These speedups are mostly measured in training speed, and the method focuses on feedforward blocks. Mixture of experts approaches have been shown to achieve computational efficiency in training, scaling up to a trillion parameters.

Conclusions and next steps

In this article, we saw that sparse models match the performance of their dense counterparts while being many times faster at inference. And, when scaling the models up, the benefits of sparsity become even larger.

However, the current results have a number of limitations. For one, the practical speedups we see are only for inference, not at training time. There is still research needed to better democratize language models. Nonetheless, this paper can be considered as a first step on the way to sustainable large models.

Possible next steps are:

Learn about the Switch Transformers and Mixture of Experts.
Learn about Scaling Laws for Neural Language Models.

Thank you for reading! If you are interested in learning more about NLP, remember to follow NLPlanet on Medium, LinkedIn, and Twitter!