Longformer: A Transformer for Long Form Documents

RISHABH TRIPATHI
Analytics Vidhya
Published in
6 min readJan 5, 2021
  • Problems associated with Vanilla Transformer on longer form docs:
  1. It is not robust enough to encode the context of longer paragraphs with distant connections. One school of thought to do so, is by dividing the paragraph into chunks and encode each and every chunk separately. But doing so doesn’t ensure any connection between the tokens from one chunk and the tokens from other chunks as tokens in a chunk is not attending tokens in other chunks. Hence, we need to find a way such that we can pass the whole document to the model so that the information even at the longer-range distance remain connected.
  2. Self-attention is quadratic by nature in terms of memory requirements and complexity, O(n²) as each token attends all the tokens in the previous layer. With increase in the value of n, memory requirements will go on increasing quadratically.
  • How to solve these problems?

We must start thinking for the possibility of reducing the memory requirements to linear complexity. Drawing an idea from CNN (which uses filters/kernels and slide it over input data and extracting information with linear complexity), the authors of the paper came up with idea of replacing this expensive attention pattern with Sliding Window Attention Pattern (explained later in this article). This will also help to encode the whole long form document retaining the context throughout the document. There are two more kinds of attention patterns that are used in Longformer: Dilated Sliding Window & Global + Sliding Window. We will now understand how these attention patterns turn useful in this architecture but before that I will take you through a journey of understanding these different kinds of attention patterns and their benefits :

1. Quadratic Attention:

This is the attention pattern followed by Vanilla Transformer where each token attends every other token. If we consider length and breadth of this attention field equal to n (i.e. number of tokens) then every single token will attend every other tokens, resulting in O(n²) memory requirements.

2. Sliding Window Attention:

This is the attention pattern adopted in the Longformer architecture. This involves a window of size (= w) which slides over the sequence length n, each token will attend itself and other tokens within that window in the previous layer and will slide till the end of the sequence costing in the linear complexity of O(w*n) = O(n), the tokens in upper layers are able to grasp longer context information. The size of window (w) is 512 as originally discussed in paper. But for understanding purpose, we are assuming smaller window size in this article.

In the above figure, you can see that, when window size=3, the token x41 in first hidden layer is attending tokens (x30, x40, x50) within the window size=3 from the previous layer. Similarly, the token x42 in the second hidden layer is attending the tokens within the window from previous hidden layer (i.e. 1st hidden layer) which themselves are attending the tokens x20, x30, x40, x50, x60 and this shows that upper layer tokens are able to learn a longer context from the input and this is the hack!! Hence, if we keep on increasing the number of layers then the tokens in the final layer of the transformer will attend a very long context with linear complexity of memory requirements.

This kind of attention pattern is useful in encoding local information from the input layer within window. We want to collect each and every information coming directly from the input so it is wise to use extract all the local information by using Sliding Window Attention Pattern in the initial layers of transformer.

3. Dilated Sliding Window Attention:

In Dilated Sliding Window Attention Pattern, each token in a layer of transformer attends itself and some more tokens in the previous layer, leaving a consistent gap (gap=1 in the image shown above in the right). The token x41 in the first hidden layer is attending the tokens: x40, x20, x60 in the previous layer, missing out the tokens x30, x50. This kind of attention pattern is used to grasp wide range of information over relatively longer context, and hence, is used in the upper or latter layers of the long form transformer.

4. Global + Sliding Window Attention:

This kind of attention pattern uses a mixture of Global attention and Sliding window attention, global attention is computed on some special tokens like [CLS] token which attends the global information through the sequence length n.

Longformer can be utilized to perform:

  • Autoregressive Modeling (Learning left to right context):

For autoregressive language modelling, with increasing layers, the size of sliding attention window size is also increased. As suggested earlier, the lower layers use sliding window attention pattern while the latter layers use dilated window attention pattern to learn the distant information without compromising the local context. Training with this objective is done in 5 phases. With every phase, the the sequence length of input is increased and learning rate is halved. The first phase started with sequence length of 2048 to a sequence length of 23,040 in the final phase.

While evaluating the model, the dataset was split into sequence length of 32,256 each, on which model was evaluated with step size of 512.

  • Pretraining using MLM (Masked Language Modeling) objective:

Pretraining Longformer is really expensive so it was suggested by the authors of the paper, to start with the checkpoints of pretrained RoBERTa, followed by Longformer. The positional embeddings used here are the absolute positional embeddings of pretrained RoBERTa, only difference is that we have to increase the length of positional embedding by copying it multiple times upto the length of the sequence. Pretraining is done with MLM (Masked Language Model) objective and the setting is same as it is in RoBERTa (including weights, number of layers etc.). Attention window size is kept to a length of 512.

  • Fine Tuning:

The models are fine tuned over several tasks like Document Classification, Coreference Resolution, Question Answering tasks. For details, I will suggest to refer to the original paper as this article is getting longer.

  • Longformer Encoder Decoder (LED):

This requires adding a decoder at the top of Longformer Encoder, that can do predictions on generation tasks like summarization. Recall the BART summarizer that was able to summarize not-too-longer paragraphs. Pretraining a Longformer for generation tasks can be very expensive so we initialize the Longformer with the checkpoint of BART with similar settings (involving number of layers and weights). Only difference is that we have to increase the size of BART’s positional embedding from 1K to 16K tokens. This summarizer has outperformed all other models like BigBird model on long document summarization task.

I have explained the mechanism of the architecture and why this idea turned out useful for longer form documents but if you are interested about the exact figures in the results obtained and the dataset used for training and testing, I will suggest to refer to this paper. It will help you gain more insights and the original paper will look easier to understand once you finish this article. I have tried to explain it lucidly from my end, hoping that it might have helped you and I thank you for your patience. Hoping the best for you in your journey, thankyou until next time!

--

--