E17 : Unlimiformer

Published in

Research Papers Summarized

6 min readJan 31, 2024

Using retrieval technique for selection of top-k token’s encoder hidden states to be attended during cross-attention in decoder layer, augments transformers to handle input texts of unlimited length

Paper Name : Unlimiformer: Long-Range Transformers with
Unlimited Length Input

Paper URL : https://arxiv.org/pdf/2305.01625.pdf

Authors : Carnegie Mellon University - Amanda Bertsch, Uri Alon, Graham Neubig, Matthew R. Gormley

Conference : NeurIPS 2023

Please find annotated paper here.

Problem Statement :

At present in most scenarios, the maximum context window or sequence length that transformers can handle is 512 or 1024 tokens depending upon the architecture of the transformers.
There are certain long-range transformers like LongFormer, PRIMERA that are specifically pre-trained for longer context windows like 2048 to 16384 tokens.
But still some real-time applications like summarisation, question and answering involve handling context windows more than 16384 tokens.
Increase in input context window necessitates the need to attend to every token’s hidden states from the last encoder layer during cross attention stage in the decode layer. This in turn increases the computational cost during pre-training, increases latency during runtime.

Solution :

Instead of attending to all the tokens during cross-attention, attending only to the important tokens from the encoder, should help increase the number of tokens that can be handled by transformers.
The solution should not require pre-training the existing architectures, from scratch as it would involve huge computational costs when doing so.

Approach :

The input sentence is split into multiple overlapping chunks.
The chunks are then encoded using an encoder.

In the encoder final layer, hidden states of only the middle half of the tokens from each chunk is considered as part of this approach, assuming the middle half of the tokens would have attended to all context tokens from both left and right portion of their respective chunk.
The hidden states for these tokens from all chunks are indexed using knn index in a datastore.
During the decoding process, the regular attention formula is reformulated such that the dot product of the query with the encoder token’s hidden states is used to filter the top-k tokens (keys) using knn index

Attention Reformulation in the paper Unlimiformer

The approach does not require pre-training the LMs from scratch, rather can be used directly at test time as well. This reduces the need for pre-training the existing architectures that can save the computational cost of pre-training process.

Experimental Setup :

Models used - BART(base-139M), PRIMERA (447M), SLED
Datasets evaluated - GovReport, SummScreen, BookSum
1. GovReport - executive summary of government reports
2. SummScreen - summaries of TV shows like F.R.I.E.N.D.S
3. BookSum - summaries of books
Metrics used for evaluation - ROUGE 1/2/L and Entity Mention Recall (EntMent)
Performance of Unlimiformer was evaluated considering two training settings — low cost training and long range training.
Low cost training involves :
1. test time unlimiformer — standard fine-tuning combined with using unlimiformer at test time only.
2. early stop with unlimiformer — standard finetuning along with using unlimiformer during validation (using early stopping) and test time.
3. train chunked + test unlimiformer — training samples are chunked into sections (when context length greater than allowed model input) without overlapping sections for training and uses unlimiformer for validation set with early stopping and for test time inference as well.
Long range training involves applying unlimiformer during training and involves
1. random-encoded training — selecting top-k keys randomly during cross attention
2. retrieval training — selecting top-k keys using knn index during cross attention
3. alternating training — using random and retrieval selection techniques in alternate batches during training

Observations :

When considering low-cost training setting for GovReport and SummScreen, introducing unlimiformer only during test time improved the performance of BART (base) compared to standard fine-tuning procedures (where long range inputs are truncated to first-n tokens in accordance with the maximum token length accepted by the model)

BART (base) performs with unlimiformer only at test time and at validation time with early stopping outperforms standard finetuning where inputs are truncated to model maximum input length

Results also show that using unlimiformer only at test time also enhances the performance of an already pre-trained long range specific model like PRIMERA that can handle 4096 tokens.
When considering long-range training setting for GovReport and SummScreen, using BART(base) trained using unlimiformer helps to outperform even specific long-range trained models like PRIMERA with standard finetuning and Memorizing transformers.

Long-range trained models with Unlimiformer shows stronger capabilities, due to the combined advantages of long range training and unlimiformer architecture.
When evaluated on BookSum, BART(base) using Unlimiformer outperformed standard fine-tuning approach under both low cost training and long range training settings.
Using Unlimiformer only just at test time doubled the EntityMentionRecall (EntMent).

Number of Tokens Vs Entity Mention Recall

Using Unlimiformer along with long-range trained models like PRIMERA clearly enhanced the performance of those models.

Both BART(base) and PRIMERA combined with UnlimiFormer both at test time only and trained completely outperform standard fine-tuning techniques that truncates the input length

Results clearly indicate that considering the entire context of long input sequences, increased the inference time sublinearly only.

A study to understand if entire context is really required for tasks like summarisation showed that the EntityMentionRecall increased with increase in context window thus emphasising the need to use the entire input sentence rather than truncating them to fixed length as when using traditional transformer architectures.
Unlimiformer was able to retrieve tokens almost equally from different parts of the document, thus proving that attending to the entire input would enhance the performance of such conditional generation tasks rather than truncating to fixed length.

Token position in the document vs Retrieval % in that location

Limitations :

Unlimiformer has been evaluated only on English datasets.
Multi-GPU config would be required during training times to accomodate embeddings of long range contexts.
Creation of a datastore and index to store the input token vectors on CPU would increase the test time latency.

Conclusion :

Unlimiformer can work on documents that contain even 500k tokens without the need to truncate the inputs to fixed input length that the models can handle.
Infact, there is no hard cap limit on the length of the documents that can be processed using Unlimiformer thus making them process unlimited context length.
Approaches like Unlimiformer can help attending to the entire document which helps capture the information present across the entire document and thus enhance the quality of output generated by both small and long range pre-trained models without requiring additional training.

E17 : Unlimiformer

Written by Praveen Thenraj