Google Gemini 1.5 Test … Success or Failure?

IvL
8 min readMay 18, 2024

--

1M tokens = 32 x 32k (1.5 Gemeni => 32 x 1.0 Gemeni)

Early 1990th…

Manager: We need to setup a new SQL DB!

Software Engineer: which do you prefer — Red or Yellow?

Manager: I heard Yellow was the best SQL DB performer at the last seminar!

1. Gemini — broke the limit!

Many people said that the key issue with AI is the limit on the context window limit. Whenever it is resolved, AI will handle everything.

Finally, Gemini breaks the limit of 32k tokens. Hooray! AI rules the world! Does it?

Ok, Let’s test it.

In the last article, we tested startups such as AI Search, Perpelxity, Algolia, and AskCSV, and they showed rather poor results: “Drop of RAG with LLM: will it make Search better?”.

Before testing such a complex system as Gemini 1.5 — LLM-based AI, we need to understand how it was done.

Let’s put a straw man and Gemini 1.5 primary picture — it will help us. As you see, they have many data threads that are processed and bound together into one. So, what could it be? Maybe they just took :

[Gemini 1.0 (limit 32k)] * [32 shards] == [ Gemini 1.5 (limit 1M)] ?

So, Theoretical architecture could be the following:

Everybody is happy! So, the solution is simple — shard it into 32k and then push 32 responses to another Gemini 1.0, and here is your answer.

BUT, if that is the Architecture, there could be issues like losing context connection, wrong understanding of separate parts, or … too much separation between parts and the wrong filter of each separate part.

I recently checked other articles about Gemini 1.5 testing. Authors continue to test modern AI via 5th-grader puzzles or “How to prepare an apple pie?”. They will not catch the issues through such basic testing.

First, you need to construct a theoretical experiment background and build a hypothesis to make a correct test. Where will it be broken? And only after that try on practice — to prove that you were right.

In the article, I will avoid the boring part of building a hypothesis and show you — how Gemini 1.5 could be partially broken by small mistakes identified in Gemini 1.0. That does not guarantee we have identified its Architecture, but it shows we are very close.

In the modern world of AI IT, it is extremely important for Investors to have deep and very complex AI testing-because they will select the “Yellow one.” Their judgment is based on some news, that was news -> of news that was done -> from the press release reported by -> the product company itself.

In the modern world of AI IT, it is extremely important for Investors to have deep and very complex AI testing—because they will select the “Yellow one.” Their judgment is based on some news, that was news -> of news that was done -> from the press release reported by -> the product company itself.

If you want to skip the detail — the final result table in the last Section 6., be aware — do not forget about the Price.

2. The Experiment: test data details

Initially, I planned to do complex testing, though I decided to start from simple… When Gemini began failing on a simple test case, I chose not to overengineer. Thanks to Kaggle (https://www.kaggle.com/datasets ), we have examples of CSV Tables and JSON files from our favorite companies: Disney and Coursera sample data.

Disney movies (CSV, 0.03MB): just 30k tokens. It’s straightforward: just name, date, and total gross.

  1. Simple Query to test: “main character is an animal
  2. Complex Query to test: “Top 3 movies where main character is an animal before 01/1990

Coursera (CSV, 2MB) — 200,000 tokens, complex structure: title, rating, description, teachers

  1. Simple Query to test: “payment improvement in healthcare
  2. Complex Query to test: “java with rating above 4.5 and review num > 5k

Why the data?: because if you convert the CSV file to plain text it will be context with high density and low connectivity between chunks. In simple words, that breaks the 32x sharding approach because, in most of the responses, you will need just 1.

As a baseline, we will use the winner from the last testing experiment, “Drop of RAG with LLM: will it Make Search Better?

We have already shown that Perplexity and Algolia perform poorly in complex test cases.

So we use AI Search: to find the correct answers and compare them to Gemini 1.0 and 1.5.

3. Test 1: Disney movies collection

3.1. AI Search: https://www.table-search.com/ :

Query: main character is an animal

Response: correcto, 1.0 point for that

Time: 0.9 sec

Query: Top 3 movies where main character is an animal before 01/1990:

Response: Perfect match, though AI Search usually does not limit response — just gave us 4, but we asked for the top 3. So: 0.8 points.

Time: 1.1 sec

3.2. Gemini 1.0-pro-002:

Query: Top 3 movies where main character is an animal:

Response: lgtm? Hm…, double checking … What is Toy Story 2? The minor issue of in Gemini 1.0 will blow up to failure in 32x and building Gemini 1.5 answer. 0.7 points

Time: 10 seconds

Query: Top 3 movies where main character is an animal before 01/1990:

Response: Pinocchio — top 1, which is definitely not a good answer. Other 2 — ok, but such failure will be combined to critical in Gemini 1.5. So 0.5 points for the task.

Time: 8 sec

3.3. Gemini 1.5-flash-preview-0514:

Query: Top 3 movies where main character is an animal:

Response: Pinicio is a very popular animal in the Gemini world. 0.5 points for that

Time: 5 sec

Query: Top 3 movies where main character is an animal before 01/1990:

Response: Pinocchio — again, have you talked to Gemini 1.0? So 0.5 points

Time: 9 sec

3.4. Gemini 1.5-pro-preview-0514:

Query: Top 3 movies where main character is an animal:

Response: What was that? Disney, can you please make a movie about an animal? Gemini pro 1.5 is waiting … 0.0 for that

Time: 9 sec

Query: Top 3 movies where main character is an animal before 01/1990:

Response: Song of the South — that is even worse than Pinocchio, though Pinocchio is also on list. Sorry 0.1 points for that!

Time: 12 sec

4. Test 2: Coursera

4.1. AI Search https://www.table-search.com/

Query: Course from the list below about healthcare data processing:

Response: Correct “Health Information Literacy for Data Analytics” is the key course. 1.0 points

Speed: ~2 sec

Query: java with rating above 4.5 and review num > 5k

Response: All correct 1.0 point

Speed 1.3 seconds

4.2. Gemini 1.0-pro-002:

Query: Course from the list below about healthcare data processing:

Response: Gemini 1.0 cannot take > 32k, so … end of story. Here, Investors dedicated to pushing 32x

Speed: N/A

4.3. Gemini 1.5-flash-preview-0514:

Query: Course from the list below about healthcare data processing:

Response: Close, but not right. Diagnosis — is a result generation, not a processing as requested. So 0.4, because you did not find key course: “Health Information Literacy for Data Analytics”

Speed: 12 seconds

  • Query: java with rating above 4.5 and review num > 5k
  • Response: I want my money back. 0.0 for this test
  • Speed 20 seconds

4.4. Gemini 1.5-pro-preview-0514:

Query: Course from the list below about healthcare data processing:

Response: No luck, no course, no healthcare. Gemini — no comments, 0.0 for that. The hypothesis works as expected.

Speed 20 seconds

Query: java with rating above 4.5 and review num > 5k

Response: Unfortunately, Gemini, but you are too far from understanding 212,311 tokens. 0.0 points

Speed: 12 sec

5. Price?

Nobody usually talks about the Price of AI, but let’s take just a glance at the Extremely high Price of the Gemini 1.5:

Pay attention — Price per Character — not a token, so in the Coursera test, we had 200,000+ tokens = 1,046,216 chars, which cost me: 1,046,216 x $0.00125 / 1,000 ~ 1.3$

If you are searching for an Apply Pie receipt in a 1M token receipt book, it will work (and cost $1.3 even without buying apples).

1+$ per one request! Good luck with building a scaled solution for that. I can imagine 1M clients visit your service :)

6. Final results and conclusion

It should not be surprising that OpenAI has not moved forward to a 1M context — there are much more complex problems than just sharding a 32k context x32 times.

Concussion: we need something more precise and affordable in real-world Enterprise systems Search than an Apple Pie from Gemini.

As usual, Thanks to AI Search for the best search in the provided data: https://www.table-search.com/

--

--