Gemini claims superiority over ChatGPT: I tried to replicate their findings

5 min readDec 25, 2023

As I experimented with Gemini, I did not feel the superiority and started to become skeptical about its claims, which led me to the question: Why?

If your first thought is, “Are they lying?” that’s understandable; however, I encourage you to steer away from that line of thinking. I trust the integrity of those developing Gemini, and questioning it is not what this is about.

— Wait, should we believe their claims without questioning?

Absolutely not; that’s the beauty of science: someone makes a breakthrough and publishes it, and others start to verify the findings through replication. We’re not questioning people’s integrity; we are advancing science by pursuing the truth.

I will analyze and try to reproduce the MMLU benchmark, which is used to demonstrate Gemini’s superiority on its website.

The image features two abstract, side-facing head profiles, one blue and the other green, against a light background. They have simplistic inner designs, with the blue head showing concentric circles and the green head sporting a single swirl. Both have a pixelated trail behind them, suggesting digital fragmentation or assembly. The artwork conveys a modern, digital vibe, possibly representing technology or communication. — Creation of DALL·E 3 for this story.

Analyzing MMLU Benchmark Results

This is the technical report about Gemini’s performance: Gemini: A Family of Highly Capable Multimodal Models.

According to the report, Gemini outperforms GPT in most tasks, as shown in the table on page 7:

In the MMLU benchmark, Gemini Ultra leads with 90.04% accuracy using CoT@32 and achieves 83.7% with a 5-shot method. GPT-4 is close behind, scoring 87.29% for CoT@32 through the API and 86.4% for 5-shot. Gemini Pro follows with 79.13% accuracy with CoT@8 and 71.8% with a 5-shot. Lastly, GPT-3.5 reached 70% accuracy with the 5-shot method.

Let’s dissect the MMLU results: This benchmark evaluates text models’ multitasking accuracy across topics like mathematics, history, computer science, law, and others. High accuracy requires models to have broad knowledge and strong problem-solving skills: Measuring Massive Multitask Language Understanding.

The test details are on GitHub and can be downloaded from the Berkeley website. The questions follow this format:

Which part of the human digestive system is primarily responsible for water absorption?
A) The stomach.
B) The small intestine.
C) The large intestine.
D) The esophagus.

Let’s explain some terms before we start the test.

Prompt engineering techniques used to improve answer accuracy:

CoT: Chain-of-Thought prompts a model to explain its reasoning step by step.
5-shot: Few-Shot provides a model with examples and their expected answers before asking a question.

Origin of results:

reported: They used numbers from other sources instead of conducting the test themselves.
via API: Results were self-collected via API.

Let’s also be precise about the *asterisks in each term:

CoT@8 and CoT@32:

The model produces a chain of thought with k = 8 or 32 samples, if there is a consensus above a threshold (chosen based on the validation split), it selects this answer, otherwise it reverts to a greedy sample. Further analysis in Appendix 9.1.

via API:

Results self-collected via the API in Nov, 2023.

Designing the Test

Here are some premises:

Gemini Ultra should be compared with GPT-4;
Gemini Pro should be compared with GPT-3.5;
The prompts for both models should be identical;
Results should be reproducible and open to challenge;
Evaluate both using APIs accessible to end users.

Reproducing the results might be challenging because:

I don’t have access to Gemini Ultra, only Gemini Pro;
Google reported results for Gemini Pro CoT@8 vs. GPT-3.5 5-shot;
I don’t understand why the site compared Gemini Ultra’s CoT@32 with GPT-4’s 5-shot instead of GPT-4’s CoT@32, which seems more reasonable. Also, it’s unclear why only GPT-4 has CoT@32 results, not GPT-3.5.

Let’s do our best given what we have:

Gemini Pro will be compared to GPT-3.5, using the gemini-pro version through Vertex AI‘s API and the gpt-3.5-turbo-1106 version through OpenAI Platform‘s API. We will also add gpt-4-1106-preview to compare reported numbers.

We will use a Chain-of-Thought approach to assess their performance on the MMLU test. The prompt:

Initial:

Which part of the human digestive system is primarily responsible for water absorption?
A) The stomach.
B) The small intestine.
C) The large intestine.
D) The esophagus.
Begin by applying relevant knowledge from anatomy. Analyze each option, considering the principles, facts, and logic specific to anatomy. Provide a detailed analysis for each option.

Follow-up:

Based on your analysis and reasoning, which option seems most justifiable and the correct answer?

Final:

Give me only the letter of the option with the correct answer.

We will use Nano Bots for Ruby to test the models with 1,760 MMLU dataset questions, scoring them 1 for correct and 0 for incorrect or unanswered. GPT-4 Turbo (gpt-4-1106-preview) will score the models’ answers.

Results

The horizontal bar chart displays reported and reproduced results for Gemini Pro, GPT-3.5, and GPT-4.

Gemini Pro: The reproduced result, at 63.98%, is lower than the reported result of 79.13%.
GPT-3.5: The reproduced result, at 63.75%, is lower than the reported result of 70.00%.
GPT-4: At 87.29%, the reported result is the highest of the three, while the reproduced result is close at 85.91%.

Gemini Pro and GPT-3.5 performed below the reported numbers and were nearly identical, differing by under 1% rather than the reported 9.13%. GPT-4’s performance nearly matched the reported data with a variance of 1.38%.

You can examine the data, analyze the details, and review the evaluation code.

Intellectual Honesty

Reasons not to trust my results and challenge them:

The model versions in the APIs (gemini-pro, gpt-3.5-turbo-1106, and gpt-4-1106-preview) may differ from those that produced the reported results.

The applied prompt engineering technique, Chain-of-Thought with two follow-up prompts, might differ from the CoT@8 and CoT@32 methods outlined in the report.

I trust GPT-4 to assess evaluations as in AlpacaEval, but this approach might have flaws. Human peer review would provide a more reliable scoring.

The tested models might have contaminated data from the MMLU dataset. Data contamination happens when models learn from “leaked” test data, skewing results and leading them to replicate answers rather than reason on new data.

The sample of 1,760 questions may not be enough for drawing conclusions, considering the full MMLU set contains 13,709 questions.

I might have messed up or missed something.

Conclusions

My MMLU test reproduction matches GPT-4’s results but contradicts those of GPT-3.5 and Gemini Pro, including their reported performance gap. Still waiting for Gemini Ultra access to check its numbers. It would be interesting to see replications of the other benchmarks for the sake of science.

Regardless, why do I feel GPT-3.5 is better when benchmarks show similar performance to Gemini Pro? Well, these benchmarks may not target my specific needs.

I’m designing a new complementary benchmark to show where I feel Gemini falls compared to GPT, although I cannot scientifically demonstrate it yet: the LBPE Score.

I’ll be sharing about it soon.