OpenAI’s GPT-4o vs. Gemini 1.5 ⭐ Context Memory Evaluation

Needle in Haystack Evaluation— OpenAI vs. Google

Lars Wiik
7 min readMay 19, 2024
Google vs. OpenAI — “Needle in the Haystack”
Google vs. OpenAI — “Needle in the Haystack”

A Large Language Model’s (LLM) ability to find and understand detailed information within large context windows is a need-to-have these days.

The Needle in the Haystack test stands as a crucial benchmark for assessing large language models for such tasks.

In this article, I will present my independent analysis measuring context-based understanding of the top-tier LLMs from OpenAI and Google.

Which LLM should you use for long-context tasks?

What is a “Needle in the Haystack” Test? 🕵️‍♂️

A “Needle in the Haystack” test for large language models (LLMs) involves placing a specific piece of information (the “needle”) within an extensive chunk of unrelated text (the “haystack”).

The LLM is then tasked to respond to a query that requires extracting the needle.

Such a test is used to evaluate an LLM’s proficiency in context comprehension and information retrieval from long contexts.

Successfully replying to the query showcases a detailed understanding of the context, which is crucial for developing applications around context-based LLMs.

The integration of custom knowledge into LLMs is becoming increasingly popular — so-called Retrieval-Augmented Generation (RAG) systems.

If you want to read more about RAG systems, you can check out one of my previous articles.

RAG article: https://medium.com/@lars.chr.wiik/a-straightforward-guide-to-retrieval-augmented-generation-rag-0031bccece7f

To further push the trend of long context windows, Google recently announced the Gemini model’s new ability to input 1 million tokens for a single query!

Image by ChatGPT showcasing an LLM finding the needle in a haystack
Image by ChatGPT showcasing an LLM finding the needle in a haystack

Dataset 🔢

I developed a script designed to create “needle-in-the-haystack” datasets. This script enables me to input two key elements:

  1. Context (Haystack): This is the text in which the unique information is inserted.
  2. Unique Information (Needle): This is the specific piece of information that needs to be identified hiding within the large context.

The dataset generation process works as follows:

  • Starting Point Selection: The script begins by randomly choosing a starting point within the large text. This starting point falls somewhere between the 10th and 40th percentile of the entire text.
  • Needle Placement: The unique information (needle) is then inserted within the haystack. Its placement within the haystack is also randomized but is constrained to fall between the 20th and 80th percentile of the haystack’s length.

LLMs are generally known to most accurately recall the information at the START and END of the prompt.

Paper: See Paper from Standford: “Lost in the Middle: How Language Models Use Long Contexts”.

This algorithm strategically places the needle within a specific percentile range of the context. This is to ensure that the evaluation captures the model’s capability to recognize and extract data from within the full scope of the text, and not just from the more easily remembered edges of the prompt.

Here is a code snipped of the dataset generation algorithm:

def create_one_needle(num_chars: int, needle_line: str, lines: list[str]):
# The start_position is a random place between the 10 to the 40 percentile of the text
rnd_place = random.randint(10, 40) / 100
start_position = int(len(lines) * rnd_place)

# The needle is between the 20 to the 80 percentile of the text
needle_rnd_place = random.randint(20, 80) / 100

lines_selected = []
placed = False
chars_used = 0
for line in lines[start_position:]:
lines_selected += [line]
chars_used += len(line)

# place the needle
if not placed and chars_used > num_chars * needle_rnd_place:
lines_selected.append(needle_line)
placed = True

if chars_used > num_chars:
break

return lines_selected

Evaluation Method 🧠

For the haystack, I used a book I loved as a child — Harry Potter.

And for the needle, I chose a fictive phone number belonging to Lars Wiik.

I created 100 haystacks for each context length — including character lengths of 1000, 2000, 4000, 8000, 12000, and 16000.

Here is an example of one of the haystacks with 1000 characters.

Example of a haystack with 1000 characters with a needle (yellow) placed at the 80th percentile
Example of a haystack with 1000 characters with a needle (yellow) placed at the 80th percentile

The different LLMs were then tasked to return the fictive phone number belonging to Lars Wiik. The replies were labeled according to whether they included the fictive phone number or not in the response.

The prompt I used looks as follows:

def create_needle_prompt(needle_text: str) -> str:
prompt = f'''
##### INSTRUCTION #####
What is the fictive phone number to Lars Wiik according to the context?
Only provide me what I want, nothing else.
You can only respond with at max 20 words.


##### CONTEXT #####
{needle_text}
'''
return prompt

Performance Results 📊

The following models were included in the evaluation:

  • gpt-4o-2024–05–13
  • gpt-4-turbo-2024–04–09
  • gpt-4–0613
  • gpt-3.5-turbo-0125
  • gemini-1.5-pro-preview-0514
  • gemini-1.5-flash-preview-0514
  • gemini-1.0-pro-002

The evaluation includes running each model through 100 different haystacks for each specific context lengths of 1k, 2k, 4k, 8k, 12k, and 16k.

Below is a line plot of the resulting accuracy graph:

Graph showcasing LLMs performance in the “Needle in the Haystack” task. Gemini 1.5. Gemini 1.0. GPT-4. GPT-4o. GPT-4-Turbo.
Graph showcasing LLMs performance in the “Needle in the Haystack” task

Note: You cannot see gpt-4o and gpt-4–0613 because they are hidden behind gpt-4-turbo-2024–04–09 with 100% accuracy!

The longer the context window, the harder it is to extract a specific piece of information because of more noise. Therefore, performance is expected to decrease with larger context windows.

As we can derive from the graph, there seems to be a distinction between OpenAI’s models and Google’s models in terms of performance.

Google’s models performed below my expectations, especially after their recent event (Google I/O 2024) where they talked warmly regarding Gemini’s memory and context understanding. All of Google’s models seem to plateau around 50% accuracy after 8k context length.

While OpenAI’s models perform noticeably well in this test, with gpt-4o, gpt-4-turbo-2024–04–09, and gpt-4–0613 as the top-performing models.

It should also be noted that gpt-3.5-turbo-0125 performs better than all Gemini models!

To validate that there was no trivial error in the evaluation, I stored all replies so I could go back and see what the LLMs actually responded.

Here are some of the responses from Gemini 1.5:

The provided context does not contain a phone number for Lars Wiik.

There is no mention of Lars Wiik or his phone number.

The provided text does not contain Lars Wiik's phone number.

The provided text does not mention Lars Wiik or his phone number.

There is no mention of Lars Wiik or his phone number.

The text does not provide Lars Wiik's phone number.

The text provided does not contain a fictive phone number for Lars Wiik.

I'm sorry, but the fictive phone number to Lars Wiik is not mentioned in the context you provided.

The Gemini model struggles to find the fictive phone number within the story of Harry Potter.

I have uploaded 10 random prompts using Gemini 1.5 with a 4k context window for anyone to reproduce. Copy the full prompt into whatever tool you use to run Gemini 1.5: Link to reproduce.

Image of reproducing the Gemini 1.5 results in Vertex AI
Image of reproducing the Gemini 1.5 results in Vertex AI

Here are some of the responses from OpenAI’s gpt-3.5-turbo-0125:

N/A

N/A

There is no fictive phone number to Lars Wiik in the provided context.

N/A

Platform nine and three-quarters.

No phone number provided for Lars Wiik.

Funny enough, the LLM once replied with “Platform nine and three-quarters” 😄

Disclaimer: It should be said that a dataset with 100 haystacks per context length is fairly small, and you should run your own tests for your spesific use case to get a better estimate of which models that performs best. Performance may also vary based on use-case.

Conclusion 💡

In conclusion, the “Needle in the Haystack” evaluation can be used to measure large language models' comprehension and information retrieval abilities when using long contexts.

In this analysis, we observed a performance disparity between OpenAI’s models and Google’s Gemini series — where OpenAI’s gpt-4, gpt-4o, and gpt-4-turbo scored the highest.

Despite Google’s recent enhancements with Gemini’s ability to handle up to 1 million tokens, it appears that OpenAI models have shown a more consistent ability to accurately retrieve specific information from large texts.

Note that for users and developers, the choice of model would likely depend on the specific needs of their application.

Thanks for reading!

Follow to receive similar content in the future!

And do not hesitate to reach out if you have any questions!

Through my articles, I share cutting-edge insights into LLMs and AI, offer practical tips and tricks, and provide in-depth analyses based on my real-world experience. Additionally, I do custom LLM performance analyses, a topic I find extremely fascinating and important in this day and age.

My content is for anyone interested in AI and LLMs — Whether you’re a professional or an enthusiast!

Follow me if this sounds interesting!

Connect with me:

--

--

Lars Wiik

MSc in AI — LLM Engineer ⭐ — Curious Thinker and Constant Learner