RULER: Benchmark to evaluate long-context modeling capabilities of language models

8 min readApr 16, 2024

Introduction

In the paper [1], authors propose RULER as a new benchmark to evaluate long-context modeling capabilities for language model. It contains four task categories to test behaviors beyond simple retrieval from context, which are as follows:

i) Retrieval : extends the needle-in-a-haystack [2] test to evaluate retrieval capability with diverse types and quantities of needles.

ii) Multi-hop Tracing: [1] proposes variable tracking, a minimal proxy task for coreference chain resolution to check the behavior of tracing entities with multi-hop connections.

iii) Aggregation : [1] proposes common/frequent words extraction, proxy tasks for summarization to test the ability to aggregate relevant information that spans long-range context.

iv) Question Answering: add distracting information to the input of existing short context QA datasets to evaluate question answering capability at various context sizes.

Motivation and Limitations of Existing Benchmarks

Recent works have mostly focused on retrieval-based synthetic tasks ( [2]; [3];[4];[5]), with a few on other types of long-context usage, including various types of reasoning [6] and long-range discourse modeling [7]
Simple retrieval-based test is indicative of only a superficial form of long-context understanding
Despite achieving nearly perfect accuracy in the vanilla Needle-in-a-Haystack (NIAH) test, all models exhibit large performance drops as the context length increases.

Comparison with existing benchmarks

RULER consists solely of synthetic tasks, offering the flexibility to control sequence length and task complexity
Synthetic input in RULER reduces reliance on parametric knowledge, which interferes with the utilization of long-context input in realistic tasks.
Following table denotes comparison between existing long-context benchmarks and RULER where “Realistic” type refers to human-annotated while “synthetic” type refers to auto-generated.

comparison between existing long-context benchmarks and RULER

RULER Benchmark

RULER comprises tasks across four categories: retrieval, multi-hop tracing, aggregation, and question answering with all tasks configurable for varying length and complexity.

i) Retrieval: Needle-in-a-haystack (NIAH)

Can be categorized into four categories:

a) Single NIAH(S-NIAH):

vanilla NIAH test where a single “needle”(“the special magic number for XXX is: YYY”) needs to be retrieved from the “haystack”.
query/key/value can take the form of words, numbers (7 digits), or UUIDs (32 digits).
“haystack” can be repeated noise sentences (“The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again.” used as noise) or Paul Graham essays.
Example below with queries and keys highlighted in purple, values highlighted in orange and distractors highlighted in gray.

b) Multi-keys NIAH (MK-NIAH):

Multiple “needles” are inserted into the “haystack”, and only one of them needs to be retrieved.
additional “needles” are hard distractors
Example below with queries and keys highlighted in purple, values highlighted in orange and distractors highlighted in gray.

c) Multi-values NIAH (MV-NIAH):

Multiple “needles” sharing the same key are inserted into the “haystack”.
All values associated with the same key need to be retrieved.
Example below with queries and keys highlighted in purple, values highlighted in orange and distractors highlighted in gray.

d) Multi-queries NIAH (MQ-NIAH):

Multiple “needles” are inserted into the “haystack”.
All “needles” with distinct keys need to be retrieved
Example below with queries and keys highlighted in purple, values highlighted in orange and distractors highlighted in gray.

ii) Multi-hop Tracing: Variable Tracking (VT)

emulate a minimal coreference chain resolution task.
This task checks the behavior of tracking relevant co-occurrence patterns and drawing skipped connections within long input.
variable X1 is initialized with a value V, followed by a linear chain of variable name binding statements (e.g., X2 = X1,X3 = X2,…), which are inserted at various positions of the input
Example below with queries and keys highlighted in purple, values highlighted in orange and distractors highlighted in gray.

iii) Aggregation

a) Common Words Extraction(CWE)

words are sampled from discrete uniform distributions, with the number of common words fixed while the number of uncommon words increases with the sequence length.
Example below with queries and keys highlighted in purple, values highlighted in orange and distractors highlighted in gray.

b) Frequent Words Extraction (FWE)

words are sampled from Zeta distribution
Zeta distribution can be defined as : Let N be the total number of words, which is determined by the context size, the frequency of the k-th ranked word(the k-th most frequently appeared word) is k−αN ζ(α) , where ζ(α) is the Zeta function. Moreover, the top-ranked word is set as noise.
Example below with queries and keys highlighted in purple, values highlighted in orange and distractors highlighted in gray.

iv) Question Answering (QA)

This category is a real-world adaptation of NIAH, where the question serves as the query, the golden paragraphs are the “needles”, and the distracting paragraphs form the “haystack”.
Example below with queries and keys highlighted in purple, values highlighted in orange and distractors highlighted in gray.

Experiments and Results

i) Experiment Setup

a) Models & Inference setup

10 long-context LLMs were selected, including 9 open-source models and one closed-source model (GPT-4), covering diverse model sizes (6B to 8x7B with MoE architecture) and claimed context lengths (32k to 1M)

b) Task configurations

For each task, each model was evaluated with 500 examples generated for each length from the series (4k, 8k, 16k, 32k, 64k, 128k), while complying with each model’s necessary chat template.

c) Effective Context Size

large performance degradation was observed in all models as we increase input length in RULER.
For determining the maximum context size a model can handle, we each model was graded with a fixed threshold, passing which indicates satisfactory performance at the length of evaluation

d) Model Ranking Criteria

While the threshold-based grading reveals the discrepancy between claimed and effective length, it lacks details for fine-grained model comparisons.
weighted average score was used to aggregate model performance across various context sizes.
models were ranked under two weighting schemes: wAvg. (inc) and wAvg. (dec) where the weight linearly increases and decreases with sequence length respectively.

ii) Results

Long Context Performance (%) of selected models evaluated at length from 4k to 128k is outlined in the table below

Each score is computed by averaging accuracy of 13 tasks in RULER and performance exceeding the Llama2–7B performance at 4K (85.6%) is underlined.
While these models all claim effective context of 32K tokens or greater, none of them maintains performance above the Llama2–7B baseline at their claimed length, except for Mixtral, which achieves moderate performance on length doubling the claimed 32K context size.
all models exhibit large degradation in RULER as sequence length increases.
best performant model on RULER is GPT-4, which has the highest performance at length of 4k and demonstrates the least but non-marginal degradation (15.4) when extending the context to 128K.
top three ranked open-source models, Command-R, Yi-34B and Mixtral, all use a large base frequency in RoPE and are larger in parameter size than other models.

Model Analysis

i) Effect of training context length

Following figure shows that larger context sizes overall lead to better performance, but the ranking can be inconsistent for long sequences as evaluated for LargeWorldModels (LWM)

model trained with 1M context size (LWM-1M) is worse than the one with 512K at length of 256K, likely due to insufficient training for adjusting to the new base frequency in RoPE.
abrupt performance drops were observed when models need to extrapolate to unseen lengths (e.g., LMW-128K given input of 256K).

ii) Effect of model size

To ablate the effect of model size, Yi-34B-200k, Yi-9B-200k, and Yi-6B-200k were evaluated, all trained up to the same context length using the same data blend.
Following figure shows that the 34B model is significantly better than the 6B model on Ruler for both performance at length of 4K and the relative degradation, suggesting the benefit of scaling model sizes for better long-context modeling.

iii) Effect of architecture

effective context length for two models with non-Transformer architectures: RWKV-v5 [8] and Mamba-2.8B-slimpj [9] was performed.
Following figure shows that both models demonstrated significant degradation when extending context size to 8K, and both underperform the Transformer baseline Llama2–7B by large margins up till the length of 4K, beyond which Llama2 shows poor length extrapolation performance.

Conclusion

RULER, as a synthetic benchmark , containing diverse task categories as retrieval, multi-hop tracing, aggregation and question answering, provides a flexible and comprehensive evaluation of LLM’s long-context capabilities.
Despite achieving perfect results in the widely used needle-in-a-haystack test, all models fail to maintain their performance in other tasks of Ruler as input length is increased.
RULER is challenging for even the top-ranked open-source models as we increase task complexity.

References:

RULER: What’s the Real Context Size of Your Long-Context Language Models? by Hsieh et al. arXiv:2404.06654
Needle In A Haystack- pressure testing LLMs. Github, 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack/tree/main
Random-access infinite context length for Transformers. by Mohtashami et al. In Workshop on Efficient Systems for Foundation Models @ ICML, 2023
Lost in the middle: How language models use long contexts. by Liu et al. Transactions of the ACL, 12:157–173, 2024c
How long can open-source LLMs truly promise on context length?, 2023a. by Li et al. URL https://lmsys.org/blog/2023-06-29-longchat
Long Range Arena: A benchmark for efficient Transformers. by Tay et al. In ICLR, 2021
ChapterBreak: A challenge dataset for long-range language models. by Sun et al. In Proc. of the 2022 Conference of the North American Chapter of the ACL: Human Language Technologies, 2022
RWKV: Reinventing RNNs for the transformer era. by Peng et al. In EMNLP, 2023
Mamba: Linear-time sequence modeling with selective state spaces. by Gu et al. arXiv:2312.00752, 2023.