Claude 3 Opus vs. GPT-4o vs. Gemini 1.5 ⭐ — Multilingual Performance

Performance Analysis of Leading LLMs

Lars Wiik
7 min readMay 24, 2024
Radar Chart of OpenAI’s top performing models — (Red GPT-4o, blue GPT-4-turbo, green GPT-4)
Radar Chart of OpenAI’s top performing models — (Red GPT-4o, blue GPT-4-turbo, green GPT-4)

In this article, I analyze the multilingual performance of OpenAI’s GPT-4o against Anthropic’s Claude 3 Opus and Google’s Gemini 1.5.

I will present how each LLM performs in various languages such as Spanish, German, French, Portuguese, and Russian, as well as more niche languages.

Note: If you are interested in these types of evaluations, consider following me to receive similar analysis in the future!

Model Overview and Pricing 💰

GPT-4o

GPT-4o (“o” for “omni”) is the newest model released by OpenAI. The name reflects its ability to handle various content forms — text, audio, and video.

It excels first and foremost in its speed and is designed to democratize AI to the masses with its fast token prediction.

In addition to its speed, GPT-4o has also shown remarkable performance in complex tasks and reasoning capabilities.

Additionally, OpenAI will release a desktop application where users can interact with the model in real-time through audio.

GPT-4o is currently priced at $5.00 / 1M tokens, which translates to:

  • $1.25 / 1 million (characters)

From OpenAI: 1 token ~= 4 chars in English [source]

Gemini 1.5

Google’s Gemini 1.5, the latest in the Gemini series, is built from the ground up as a multimodal model capable of processing words, images, videos, audio, and code.

It integrates seamlessly into Google’s ecosystem such as Gmail and the rest of G Suite— and we will likely see AI features in every Google product soon.

And as we all know, Google is known for providing scalable and reliable services, which is extremely important when building products around LLMs.

Gemini has dropped its price for contexts up to 128k to $3.5 / 1 million tokens that, when translated to characters, become:

  • $1.25 / 1 million (characters)

Official pricing: $0.00125 / 1k characters [source]

Claude 3 Opus

Anthropic’s Claude 3 Opus focuses on safety and alignment while delivering competitive language performance.

With low hallucination rates, Claude 3 Opus is proficient in English and European languages, with ongoing improvements in Asian and niche languages.

It excels in processing very long documents accurately — making it ideal for RAG applications if you want optimal performance.

However, as a by-product of its high performance, it is considered expensive and somewhat slow.

Claude 3 Opus is currently priced at $15 / 1 million tokens, which translates to around:

  • $4.3 / 1 million (characters)

For Claude, a token approximately represents 3.5 English characters [source]

Scalable applications on top of closed-sourced LLM APIs can be expensive to maintain.

Keep these prices in mind when we analyze the models' performance!

Image by ChatGPT — Illustration of an LLM API sucking money out of a business
Image by ChatGPT — Illustration of an LLM API sucking money out of a business

Evaluation Framework

I used a dataset described in my previous articles for the evaluation framework — Topic Dataset.

For readers that are unfamiliar with this, here is a recap of this dataset:

The dataset consists of 200 sentences in each language categorized under 50 different topics (some of which closely relate).

I manually created the English dataset and used GPT-4 to translate the dataset into multiple languages.

The task given to the language models is to match each sentence with the correct topic, which allows for an accuracy measurement per language.

GPT-4o vs. GPT-4 vs. GPT-4 Turbo 📊

Firstly, I wanted to compare OpenAI’s most prominent models — as I have read numerous complaints regarding GPT-4o’s performance.

Given that GPT-4o is half the price of GPT-4 Turbo and six times cheaper than GPT-4, this comparison should provide valuable insight.

I decided to select the most prominent European languages and some more niche languages. If you are not familiar with language codes, see the table below:

Language Code to Language Map
Language Code to Language Map

The first step was to run the evaluation framework on all OpenAI’s models to gather accuracy scores for each language code.

I then created a Radar Chart to visualize each LLM’s performance for each language code. I personally believe that a Radar Chart is the most visually pleasing way to present performance differences like these.

Radar Chart of OpenAI’s top performing models. Note that the scale starts at 95% and goes to 100%.

Just to recap how a Radar Chat works, a better-performing model would stretch out to the edges. And less performant models would remain within a closer circle near the middle.

As we can derive from the graph, GPT-4o is generally further out than GPT-4 and GPT-4-turbo — indicating better overall performance.

Portuguese is the only language where GPT-4o underperforms according to this test — however, with such a small dataset the underperformance is not statistically significant and could be due to random variation or specific challenges in the dataset.

Interestingly, we see noticeable performance gains for GPT-4o in Russian and Finnish.

Note: During my professional Machine Learning Engineering career, I have constantly seen issues optimizing NLP tasks for Finnish as it is a somewhat niche language — However it seems like GPT-4o finally breaks this pattern!

Claude 3 Opus vs. GPT-4o vs. Gemini 1.5 📊

After the comparison between OpenAI’s top LLMs, I selected Claude 3 Opus and Gemini 1.5 to see how they stack up against GPT-4o.

These models truly showcase state-of-the-art language understanding in various tasks — and have all showcased strong multilingual capabilities.

I employed the same evaluation framework used in the previous comparison.

This framework tests each model across various language codes, allowing us to generate a detailed performance profile for each language.

Let’s see where GPT-4o stands among its competitors.

Radar Chart of GPT-4o vs. Gemini 1.5 vs. Claude 3 Opus. Performance Analysis. Accuracy.
Radar Chart of GPT-4o vs. Gemini 1.5 vs. Claude 3 Opus. Note that the scale starts at 95% and goes to 100%.

As the graph shows, all models are performing quite well. It should be noted that the graph is scaled and starts with 95% accuracy and goes to 100%.

This means all three LLMs score between 97.5% and 100% in all languages — showcasing excellent multilingual language capabilities.

However, the graph shows a trend where Anthropic’s Claude 3 Opus is slightly ahead in most languages — which seems consistent. Claude 3 Opus only underperforms in two languages compared to Gemini and is never beaten by GPT-4o.

Is Anthropic’s Claude 3 Opus the strongest LLM at the moment?

And is it worth paying six times more to use Claude 3 Opus instead of GPT-4o or Gemini 1.5?

Please let me know your thoughts!

Disclaimer: As mentioned before, it must be noted that the dataset is relatively small. The results should only be interpreted as an indication of the model's performance, and it may vary based on the use case.

Conclusion

In this multilingual evaluation of OpenAI’s GPT-4o, Anthropic’s Claude 3 Opus, and Google’s Gemini 1.5, several key insights emerge.

GPT-4o stands out for its remarkable performance across a broad range of languages, consistently outperforming GPT-4 and GPT-4 Turbo. This is especially noteworthy given its significantly lower cost.

Gemini 1.5 showcased performance on par with GPT-4. Its competitive pricing and scalability make it a strong contender, particularly for those already embedded in Google’s suite of products.

Claude 3 Opus showcases superior performance in most languages. However, this comes at a higher cost, which might be a consideration for businesses when balancing performance and budget.

The choice between these models should factor in cost, specific language requirements, and broader ecosystem integration needs.

As the landscape of language models continues to evolve, it will be interesting to see how these models develop further and whether brand-new models come along and challenge their current capabilities and market dominance.

Thanks for reading!

Follow to receive similar content in the future!

And do not hesitate to reach out if you have any questions!

(And if you like what you just read, please consider showing your support as it means the world to me 😊)

Through my articles, I share cutting-edge insights into LLMs and AI, offer practical tips and tricks, and provide in-depth analyses based on my real-world experience. Additionally, I do custom LLM performance analyses, a topic I find extremely fascinating and important in this day and age.

My content is for anyone interested in AI and LLMs — Whether you’re a professional or an enthusiast!

Follow me if this sounds interesting!

Connect with me:

--

--

Lars Wiik

MSc in AI — LLM Engineer ⭐ — Curious Thinker and Constant Learner