Longformer — The Long-Document Transformer 📝

Published in

DAIR.AI

7 min readApr 30, 2020

Processing longer forms of text with BERT-like models require us to rethink the attention mechanism in more than one way. How can we reduce the computational cost of the attention calculations, which grow quadratically with sequence length? Do we really need all tokens to attend to every other one in the sequence? These questions, and more, will be answered in this paper summary!

The fact that Transformer based language models are computationally expensive to both train and use should come as no surprise. This is partly due to an ever-growing number of parameters but also the intrinsic cost of its attention mechanism.

The attention mechanism allows the models to enrich each token representation with information from anywhere else in the sequence, which is at the core of Transformer based models’ success. Put simply, to process a sequence of n tokens requires n² attention calculations for each attention head during a forward pass.

BERT addresses this by enforcing a hard limit of 512 tokens, which is more than enough to process the overwhelming majority of sequences in most benchmark datasets. But what if we want to work with longer forms of text, moving away from sentences and into the realm of documents? That would require us to rethink the attention mechanism.

An initial thought that might seem obvious is the fact that the value a token brings to another in the attention mechanism diminishes the further apart they are. It should, therefore, make sense to limit the attention window each token has access to. This seed of an idea is what has been explored by Beltagy et al. in Longformer: The Long-Document Transformer. Their contribution and experimental findings are summarised below!

Contribution

Longformer introduces an attention mechanism that grows linearly with sequence length through introducing a sliding window of size w. This limits each token to only attend a subset of all tokens — the local ones thought to bring the most value. While this attention pattern might seem limited, it still allows a multi-layer transformer network to have a receptive field that covers the entirety of the sequence.

The authors also introduce a dilated sliding window attention pattern to allow the receptive field to cover an even larger range. Here, w attention positions are separated by d empty spaces. I think you can see how this achieves the intended effect.

An issue the observant reader will realize is how these attention windows affect special tokens such as [CLS] and [SEP]. The CLS token is supposed to be able to aggregate the entire sequence into a single representation to allow for classification. This becomes a bit weird when it cannot attend all tokens directly, even though it might be able to reach them indirectly through its receptive field. The authors address this by introducing task-specific, global attention on special tokens such as this one. This attention is symmetric in that every token in the sequence can attend to the special token, as it can attend all of them.

A great illustration of the three attention patterns described above.

A nicety the authors also mention as a key contribution is the CUDA kernel for these attention patterns. These are different enough to not simply be implemented efficiently by existing libraries due to their banded nature. The image below illustrates the computational speed (left) as well as the above-discussed memory savings (right) the novel attention patterns bring for longer sequences. The comparison is performed between full self-attention, a naive “for loop” implementation of their banded matrix multiplications, and their optimized version.

How these novel attention patterns are deployed within the Longformer model is a research question in itself. It would make sense for lower layers to learn local features and enable later layers to combine these into higher-level sequences, similarly to how computer vision models learn its representations. The authors share this logic and gradually increase the attention window size for higher layers. Only a couple of the higher layers are configured with dilated sliding window attention.

These design choices provide a balance between efficiency (smaller attention window allows for faster computation) and performance (larger attention window allows for greater representational power).

To get a bit ahead of myself the ablation studies in the paper show that this approach achieves higher performance than both consistent window size throughout and configuring lower layers with larger ones and tapering off towards the later ones.

Evaluation

The authors performed two distinct experiments to evaluate the Longformer, one to evaluate its language modelling capabilities and a second how well suited it is for the common pretraining-finetuning process. It’s the capability of performing both these tasks that set Longformer apart from some of its competition.

Autoregressive language modelling

Autoregressive language modelling, also known as left-to-right language modelling, is the task of predicting the next token, either word or character, given the left context. How well a model performs this task is evaluated by a metric called Bits Per Character (BPC), the average log loss measured in base two for the correct token. If you want to really dig deep on this topic I suggest this article as a good starting point.

Two models are created to allow for valuable comparisons to its competitors; a large 30 layer- and a smaller 12 layer Longformer, both with a hidden dimension of 512. The evaluation was performed by running their model over sequences of 32.256 characters where performance was evaluated on the last 512 characters. This is in line with previous work. The authors achieve state-of-the-art results with their smaller model for both datasets, 1.10 and 1.00 BPC for text8 and enwik8 receptively when comparing to ones of similar size.

The 30 layer model, when compared to true state-of-the-art achieves comparable performance, even when compared to larger models such as Transformer-XL (102M vs 277M parameters). What is worth noting here is the fact that Longformer can also be used for MLM pre-training tasks which is not possible by all models used in this comparison.

Pretraining and finetuning

Pretraining refers to the training scheme where a model initially is trained to perform a base task on a large general dataset, and then fine-tuned on the specific task and dataset. The general task for this NLP model is Masked Language Modelling (MLM) as popularised by BERT. MLM is computationally expensive which is why, to speed up their pretraining process, the authors initialize their model with weights from an already pretrained RoBERTa model (of the same dimensions of course). This highlights an important fact: The attention patterns are simple enough to be dropped into existing model architectures!

Here, Longformer is evaluated in two distinct scenarios. The first one is to answer if the attention patterns can act as a replacement for standard self-attention patterns. This is achieved through comparing against RoBERTa, which deals with sequences longer than 512 tokens through breaking them up into manageable pieces, processing each separately, and then concatenate the embedded tokens for further processing.

Compared to RoBERTa-base which was used for weight initialization, Longformer achieves higher performance numbers across all six benchmark tasks. The improvement is most notable on Hyperpartisan (a small dataset of documents with 705 wordpiece tokens on average)

The second scenario used to evaluate Longformer's capabilities is in comparison to state-of-the-art models on the QA datasets. Some of its competitors employ task-specific architectures and training processes which, while achieving good results, are cumbersome to design and hard to adapt to other datasets or tasks. This would allow us to answer if these methods could be left to the bleeding edge research and allow us to have a simple model that performs well enough in most cases.

To our benefit, that is essentially what is found! Longformer-large achieves state-of-the-art results on both WikiHOP and TriviaQA by a significant margin (3.6 and 4 points respectively). On HotpotQA, Logformer-large achieves comparable results to both larger and more complex models, which answers the question this evaluation aimed to address.

Conclusion

This article has summarised the motivations, contributions, and experimental findings of Longformer: The Long-Document Transformer. It becomes clear that the attention patterns introduced by this work are versatile enough to be introduced into already existing Transformer architectures, while at the same time able to outperform its competition for some tasks. Even task-specific architectures and training schemes are surpassed or at least matched with this much simpler approach.

If you found this summary helpful in understanding the broader picture of this particular research paper, please consider reading my other articles! I’ve already written a bunch and more will definitely be added. I think you might find this one interesting 👋🏼🤖

TinyBERT — Size does matter, but how you train it can be more important. 🐣

7x smaller and 9x faster than BERT while achieving 96% of its performance. What more do I need to tell you?

medium.com