Why Gemini 1.5 (and other large context models) are bullish for RAG

Published in

WhyHow.AI

6 min readFeb 18, 2024

The introduction of Gemini 1.5, boasting a 1 million token context window, has sparked discussions in the AI community, with some predicting a negative impact on Retrieval-Augmented Generation (RAG). I argue that Gemini 1.5 represents a significant positive development for RAG, as it highlights the core strength of RAG — which are non-black box ways to optimize for cost, accuracy and latency, in ways that relying on Gemini 1.5 is not able to. I also point out clear limitations in Gemini performance as stated within Google’s technical paper.

The first obvious drawbacks that Gemini 1.5 suffers from are cost and latency, which are in practice obstacles to enterprise-grade RAG systems.

Cost

Latency

Most arguments against this view would fairly take the argument that the long-term end-state of LLM models would be to reduce costs and latency down to a de minimis amount, which I accept. However, for enterprises adopting LLMs for information retrieval today, utilizing a brute force full context window approach with Gemini 1.5 is unlikely to be the approach that brings applications to production.

Accuracy

Despite the headline news that Gemini 1.5 can function well with the 1M context window, Google’s own paper showed quantitative issues with so many tokens in context. The below image is the multiple ‘Needle in a Haystack’ test, which asks the model to retrieve a few given sentences from a length of text. The recall rate is the success rate that they were able to spot the given sentences in the information given.

Chris Bartholomew of Vectorize.IO says it well: “From the chart you can see that although Gemini 1.5 is better than GPT-4 Turbo (within the overlapping context window range) and Gemini can maintain its recall capabilities all the way to 1M tokens, the average recall hovers around 60%. The context window may be filled with many relevant facts, but 40% or more of them are “lost” to the model. If you want to make sure the model is actually using the context you are sending it, you are best off curating it first and only sending the most relevant context. In other words, doing traditional RAG.”

A 40% failure to spot sentences that are within the context is not something that is reliable for systems in production, particularly when you need to ensure accurate answers to ensure humans adopt the system. The stark difference in the success Gemini 1.5 had in single needle (i.e. spot a single sentence) versus multiple needle (i.e. spot multiple sentences) imply that real-world questions which tend to be more complicated than ‘find this sentence for me’ will struggle with large context window recall, even with Gemini.

Giving an LLM more irrelevant information is objectively always bad, and we are a far way away from LLMs being able to guarantee that they can process all the information being thrown at them. It is also not clear if these benchmarks are generalizable for the range of real-world cases.

Gemini’s performance

I am optimistic that Google and Gemini will end up as one of the top contenders in the Foundation Model space over time. However, the Gemini 1.5 technical paper strangely only compared itself to GPT-4 in the Needle in a Haystack test.

For other benchmarks like Text, Vision, Audio, etc, it is interesting to note that Gemini did not compare itself to GPT-4 but instead compared itself to older versions of Gemini. This gives rise to the suspicion that these relative benchmarks to GPT-4 were not performed or shared because they were not flattering.

Why this is bullish for RAG

What is dead may never die — Gemini can’t kill RAG when complex RAG still requires better systems to get to prod: Long context windows, like those in Gemini 1.5, increase precision margins for RAG systems, allowing for less precise retrieval yet meaningful processing. This facilitates easier RAG production, addressing the current lack of infrastructure and speeding up development, ultimately enhancing RAG applications. This is great because production RAG still doesn’t really exist. We need more infrastructure like Gemini to support production RAG, not less.

Optimizing at the end-state: Let us imagine a world where a model has emerged with truly reliable large context windows. Let us imagine all companies have adopted this model. How would these models compete? They would want to reduce costs and decrease latency. The most common refrain we have heard form the developers we work with has been that “We want GPT-4 performance at the cost of GPT3.5”. Model costs are a real obstacle to production that developers are working on optimizing on.

How would they reduce costs and decrease latency? They would do this by trying to minimize the amount of information sent into the context window. This is essentially RAG. RAG is fundamentally an information optimization process to help reduce the amount of irrelevant information to send to the model. Even the most powerful models today benefit from greater accuracy if more relevant information is fed to them. RAG is likely to be a lasting mechanic, especially for enterprise-ready systems, and especially for complex RAG systems.

Empowering troubleshooting, not black boxes: Replacing a workflow system with a monolithic model reduces the number of ways to tweak the system if the LLM output is not exactly what you want.

A common refrain with developers who have been working in LLM-enabled software has been to reduce the number of touchpoints by the LLM within a workflow. Instead, the preference has been to break down an LLM workflow into discrete tasks that can be individually audited and troubleshot. This enables developers to more easily understand how and why a particular reasoning or LLM workflow process has failed at particular steps of the process, and experimenting with fixes only at those discrete points.

With a monolithic model where all information is thrown into the context window, and an output is given, there is no real way to troubleshoot a system, except through the ‘black box’ of prompting. This is functionally the equivalent of throwing things at a wall repeatedly to see what sticks, and is not a reliable way of building enterprise-grade products.

A non-monolithic system, like RAG, provides developers with levers to understand the errors that emerge as they come, and to tweak the information provided to an LLM as a response.

Further reads:

WhyHow.AI is building tools to help developers bring more determinism and control to their RAG pipelines using graph structures. If you’re thinking about, in the process of, or have already incorporated knowledge graphs in RAG, we’d love to chat at team@whyhow.ai, or follow our newsletter at WhyHow.AI. Join our discussions about rules, determinism and knowledge graphs in RAG on our newly-created Discord.