Evaluating 10M Context Length of Gemini 1.5 : How good is it?

5 min readFeb 19, 2024

Gemini1.5 identifying specific location of a core automatic differentiation method , for a 746,152 token JAX dataset.

Gemini 1.5 now has the bragging rights of LLM model with maximum context length(10M tokens) possible across all the Large language Models available, surpassing even Anthropic’s Claude 2.1’s 200K Context Window.

But, higher context length does not always means better performance and that too over multiple modalities, considering Gemini 1.5 is a multimodal model. So in order to better understand if that high of a context length, without impacting core multi-modal capabilities, it makes sense to do a deep dive understanding of how the model was evaluated over long context length.

Going over Gemini 1.5’s technical report [1], long-context evaluations can mainly be grouped into two high level categories which are:

Qualitative evaluations

Testing model capabilities on tasks for which no quantitative benchmarks exist, via some of the following interactions

With the entire text of Les Misérables in the prompt (1382 pages, 732k tokens), Gemini1.5 Pro is able to identify and locate a famous scene from a hand-drawn sketch.

When prompted with a 45 minute Buster Keaton movie “Sherlock Jr.” (1924) (2,674 frames at 1FPS, 684k tokens), Gemini 1.5 Pro retrieves and extracts textual information from a specific frame in and provides the corresponding timestamp. At bottom right, the model identifies a scene in the movie from a hand-drawn sketch.

Quantitative evaluations

Evaluations on synthetic and real-world tasks with well defined metrics

a) Perplexity over long sequences:

Models ability to make use of long context length to improve next token prediction was the objective function used.

To evaluate that objective function, negative log-likelihood (NLL) was recorded for tokens at different positions in the input sequences from held-out text (i.e., not used in training). Lower value of NLL implies improved prediction, with higher NLL expected at beginning when there is no context.

As, it can be seen from graph above, Gemini 1.5 improves predictions up to 1M tokens for long-documents and 10M tokens for code.

b) Needle-in-a-haystack evaluation:

Long-context recall tested on recently introduced needle-in-a-haystack evaluation (Kamradt, 2023 [2]) which tests a model’s ability to retrieve a text (i.e., “needle”) inserted at various positions into a sequence (i.e., “haystack”).

These tests are performed for text, audio and video where pieces of text, audio and video were inserted at linearly spaced intervals from beginning to end of context, and model was tasked to identify it.

Text haystack eval: Gemini 1.5 Pro compared with GPT-4 Turbo for the text needle-in-a-haystack task

Video haystack eval: Gemini 1.5 Pro compared with GPT-4V for the video needle-in-a-haystack task

Audio haystack eval: Gemini 1.5 Pro compared with combination of Whisper and GPT-4 Turbo for the audio needle-in-a-haystack task

Retrieval performance of the “multiple needles-in-haystack” task

As it can be inferred from the graph and figures above, Gemini 1.5 Pro compared to GPT-4 Turbo, achieved higher recall at shorter context lengths, and a very small decrease in recall towards 1M.

c) In-context language learning:

In-context learning abilities over long context was evaluated on Machine Translation from One Book (MTOB) benchmark (Tanzer et al., 2023 [3]). MTON measures ability to learn to translate between English and Kalamang which is language with fewer than 200 speakers ,no web presence and fewer learning resources(~ around 250k tokens).

Quantitative results for Kalamang↔English translation on MTOB. Human evaluation scores on a scale of 0 to 6, with 6 being an excellent translation with automatic metrics (BLEURT/chrF) in parentheses.

As per human evaluation results above, Gemini 1.5 outperforms GPT-4 Turbo and Claude 2.1 by a big margin.

d) Long-Document Question Answering

For this evaluation, questions were created using the book “Les Misérables” (by Victor Hugo), in order to test the model’s ability to answer them correctly when the entire 1,462 page book (ie., 710K tokens) is provided as input.

In order to evaluate model response, a set of 100 questions were created and a human evaluation was performed following the Attributable to Identified Sources (AIS) protocol (Rashkin et al., 2021 [4])

Evaluating long document QA: 0-shot with no context provided, 4k retrieved using passages in total up to 4k sentence piece tokens, and the entire book serving as context.

AutoAIS is a LLM metric, which is used to assess the factual accuracy of generated responses by checking their alignment with the source material. As per both the automated and human evaluation, Gemini 1.5 outperformed Claude 2.1, in retrieving relevant answer from the longer context text provided as input.

e) Long- Context Audio

Word error rate (WER) for various models on 15-minute videos.

Without the added complexity of extra input segmentation and pre-processing, Gemini 1.5 Pro can transcribe 15-minute videos more accurately than other models, achieving a WER of 5.6%.

f) Long-context Video QA

Comparison between GPT-4V and Gemini 1.5 Pro on 1H-VideoQA by sampling one video frame-per-second and linearly subsampling 16 or 150 frames

Gemini 1.5 Pro achieves SotA accuracy over GPT-4V for this task.

Conclusion

With the evaluations as mentioned across various modalities, it can be concluded that Gemini 1.5 do perform significantly well over longer context as compared to other comparable longer context models across different tasks.

However , given the limitations of existing benchmarks across both length and complexity, and human labeling being costly and time-intensive, there is a limitation of how those evaluations closely reflect how good Gemini 1.5 is for longer context tasks. So, that creates a need to have more sophisticated evaluation benchmarks, datasets and methods for proving out how good the 10M context length of Gemini 1.5 is and how that translates to provide better value and performance for LLM applications across various use cases.

References:

[1] Gemini 1.5 Pro technical report (Gemini1.5:Unlocking multimodal understanding across millions of tokens of context)
[2] https://github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/README.md.
[3] A Benchmark for Learning to Translate a New Language from One Grammar Book by Tanzer et al. arXiv:2309.16575
[4] Measuring attribution in natural language generation models by Raskin et al..CoRR,abs/2112.12870