GPT-4o vs. GPT-4 vs. Gemini 1.5 ⭐ — Performance Analysis
Measuring English Language Understanding of OpenAI’s New Flagship Model
OpenAI’s recent unveiling of GPT-4o has set the stage for a new era in AI language models and how we interact with them.
The most impressive part was the support of live interaction with ChatGPT with conversational interruptions.
Despite some hiccups during the live demo, I cannot feel other than amazed at what the team has accomplished.
Best of all, right after the demo, OpenAI allowed access to the GPT-4o API.
In this article, I will present my independent analysis measuring the classification abilities of GPT-4o vs. GPT 4 vs. Google’s Gemini and Unicorn models using an English dataset I created.
Which of these models are strongest in English understanding?
What’s New with GPT-4o?
At the forefront is the concept of an Omni model, designed to comprehend and process text, audio, and video seamlessly.
The focus of OpenAI appears to have shifted towards democratizing GPT-4 level intelligence to the masses, making GPT-4 level language model intelligence accessible even to free users.
OpenAI also announced that GPT-4o includes enhanced quality and speed across more than 50 languages, promising a more inclusive and globally accessible AI experience, for a cheaper price.
They also mentioned that paid subscribers would get five times the capacity compared with non-paid users.
Furthermore, they will release a desktop version of ChatGPT to facilitate real-time reasoning across audio, vision, and text interfaces for the masses.
How to use the GPT-4o API
The new GPT-4o model follows the existing chat-completion API from OpenAI, making it backward compatible and simple to use.
from openai import OpenAI
OPENAI_API_KEY = "<your-api-key>"
def openai_chat_resolve(response: dict, strip_tokens = None) -> str:
if strip_tokens is None:
strip_tokens = []
if response and response.choices and len(response.choices) > 0:
content = response.choices[0].message.content.strip()
if content is not None or content != '':
if strip_tokens:
for token in strip_tokens:
content = content.replace(token, '')
return content
raise Exception(f'Cannot resolve response: {response}')
def openai_chat_request(prompt: str, model_name: str, temperature=0.0):
message = {'role': 'user', 'content': prompt}
client = OpenAI(api_key=OPENAI_API_KEY)
return client.chat.completions.create(
model=model_name,
messages=[message],
temperature=temperature,
)
response = openai_chat_request(prompt="Hello!", model_name="gpt-4o-2024-05-13")
answer = openai_chat_resolve(response)
print(answer)
GPT-4o is also available using the ChatGPT interface:
Official Evaluation
OpenAI’s blog post includes evaluation scores of known datasets, such as MMLU and HumanEval.
As we can derive from the graph, GPT-4o’s performance can be classified as state-of-the-art in this space — which sounds very promising considering the new model is cheaper and faster.
However, during the last year, I have seen multiple models that claim to have state-of-the-art language performance across known datasets.
In reality, some of these models have been partially trained (or overfit) on these open datasets resulting in unrealistic scores on leadboards. See this paper if you are interested.
Therefore, it is important to do independent analyses of the performance of these models using lesser-known datasets — such as the one that I created 😄
My Evaluation Dataset 🔢
As I have explained in previous articles, I have created a topic dataset that we can use to measure classification performance across different LLMs.
The dataset consists of 200 sentences categorized under 50 topics, where some closely relate intending to make classification tasks harder.
I manually created and labeled the entire dataset in English.
I then used GPT4 (gpt-4–0613) to translate the dataset into multiple languages.
However, during this evaluation, we will only evaluate the English version of the dataset — meaning that the results should not be affected by potential biases originating from using the same language model for dataset creation and topic prediction.
Go and check out the dataset for yourself: topic dataset.
Performance Results 📊
I decided to evaluate the following models:
- GPT-4o: gpt-4o-2024-05-13
- GPT-4: gpt-4-0613
- GPT-4-Turbo: gpt-4-turbo-2024-04-09
- Gemini 1.5 Pro: gemini-1.5-pro-preview-0409
- Gemini 1.0: gemini-1.0-pro-002
- Palm 2 Unicorn: text-unicorn@001
The task given to the language models is to match each sentence in the dataset with the correct topic.
This allows us to calculate an accuracy score per language and each model's error rate.
Since the models mostly classify correctly, I am plotting the error rate for each model.
Remember that a lower error rate indicates better model performance.
As we can derive from the graph, GPT-4o has the lowest error rate of all the models with only 2 mistakes.
We can also see that Palm 2 Unicorn, GPT-4, and Gemini 1.5 were close to GPT-4o — showcasing their strong performance.
Interestingly, GPT-4 Turbo performs similarly to GPT-4–0613. Check out OpenAI’s model page for additional info on their models.
Lastly, Gemini 1.0 is lagging behind, which should be expected given its price range.
Multilingual
I recently published another article measuring the multilingual capabilities of gpt4o against other LLMs such as Claude Opus and Gemini 1.5.
Link to article: https://medium.com/@lars.chr.wiik/claude-opus-vs-gpt-4o-vs-gemini-1-5-multilingual-performance-1b092b920a40
Needle in the Haystack
I have written another article evaluating how GPT-4o and Gemini 1.5 remember their context using the “needle in the haystack” framework. Check it out using the link below!
OpenAI’s GPT-4o vs. Gemini 1.5 ⭐ Context Memory Evaluation
https://medium.com/@lars.chr.wiik/openais-gpt-4o-vs-gemini-1-5-context-memory-evaluation-1f2da3e15526
Conclusion 💡
This analysis using a uniquely crafted English dataset reveals insights into the state-of-the-art capabilities of these advanced language models.
GPT-4o, OpenAI’s latest offering, stands out with the lowest error rate among the tested models, which affirms OpenAI’s claims regarding its performance.
The AI community and users alike must continue performing independent evaluations using diverse datasets, as these help in providing a clearer picture of a model’s practical effectiveness, beyond what is suggested by standardized benchmarks alone.
Note that the dataset is fairly small and results might vary depending on the dataset. The performance was done using the English dataset only, while a multilingual comparison will have to wait for another time.
Thanks for reading!
Follow to receive similar content in the future!
And do not hesitate to reach out if you have any questions!
Through my articles, I share cutting-edge insights into LLMs and AI, offer practical tips and tricks, and provide in-depth analyses based on my real-world experience. Additionally, I do custom LLM performance analyses, a topic I find extremely fascinating and important in this day and age.
My content is for anyone interested in AI and LLMs — Whether you’re a professional or an enthusiast!
Follow me if this sounds interesting!
Connect with me:
- 🌐 LinkedIn: https://www.linkedin.com/in/lars-christian-wiik
- 𝕏: https://x.com/LarsChrWiik (I just now created an account and will start posting here as well)
- Medium: https://medium.com/@lars.chr.wiik/about