Claude 3.5 Sonnet V/S GPT-4O: Which one is better

Harshita Katiyar
𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨
7 min readJul 31, 2024

In November 2022, OpenAI launched ChatGPT, a model that has revolutionized how we search and interact with information. Next year, in March, an American startup,” Anthropic,” founded by ex-OpenAI employees, launched their own AI model, “Claude.” Since the launch, both AI companies have been competing to bring the best to customers regarding features and experience through their AI models. Recently, OpenAI launched “GPT-4o,” a spectacular model that handles file, voice, and video data amazingly. Similarly, Claude launched the “Claude 3.5 Sonnet,” which is the most advanced AI model, as they claimed, and can handle complex tasks. In this article, we will determine which is better, between Claude 3.5 Sonnet and GPT-4o, and compare its features and output with the same input to check which is better for you.

Capabilities and Features

GPT-4o

GPT-4o is the latest LLM launched by OpenAI. The “o” stands for omni, which means “every” in Latin. This model can analyze voice, images, videos, and files as input and respond accordingly. It can take voice input and give the output in different characters’ voices, including tones, emotions, etc. The whole process is as low as a human conversation, with an average of 0.32 seconds compared to other voice models, which is 2.8 seconds. It also allows users to generate written content such as articles, blogs, product descriptions, code in different programming languages, data analysis, charts, etc. In addition, GPT-4o can also analyze images and videos, which makes the model act as a language translator, personal assistant, virtual teacher, or shopping assistant. It can also be used in medicine, engineering, the military, etc. To use this feature, GPT-4o can use the user’s camera to get a real-time view and respond accordingly in the voice mode. It can also access your computer screen and describe what is shown on the screen, users can ask questions related to the stuff displayed on the screen.

For example, users can enable the model on the screen, open the VS code, and prompt the model to act as a coding assistant to get answers to the coding problems. Alternatively, you can enable the camera to act as a fitness trainer whether you are doing it correctly or not.

The model has unique features, such as data analysis, code interpreter, and real-time web browsing, making it different from its competitors. The model also has a plethora of GPTs, which is a tailored version of ChatGPT.

Claude 3.5 Sonnet

Claude 3.5 Sonnet is the AI chatbot launched by Anthropic. It is the third generation of the family of Claude AI model series. This model has stood at a high bae and outperformed many AI models on various evaluations, keeping the hallucinations and wrong information away. While it doesn’t support voice and video features like GPT-4o, it can also perform all the basic tasks, such as text generation and code generation in different programming languages, brainstorming ideas, etc. According to the report by Anthropic, Claude 3.5 Sonnet is one of the best computer vision models in the market, which can be used to analyze charts and graphs, transcribe texts from images, and many more. Claude is powered by an advanced feature, “Artifacts,” a special popup window along the conversation, allowing the users to check the code snippets, text documents, or website designs and allow them to edit the output in real-time.

For example, users can use computer vision and artifacts in their workflows. Users can make essential prototyping of a website’s design on paper, attach the file with Claude 3.5 Sonnet, and prompt it to design a website based on the prototype. The generated code and the website design appear in the artifacts. Users can edit the code and the design according to their requirements. Users can also publish their projects live on the Internet.

Head-to-Head Comparison

In this section, we will compare the two LLMs based on factors such as complex reasoning and code generation, check out their capabilities in handling complex tasks, and see which model is best.

  • Graduate Level Reasoning(GPQA, Diamond)
    This factor evaluates the models’ ability to handle complex, high-level reasoning tasks at a graduate level of education. In this task, researchers compare the model on the GPQA test, a set of 448 questions in different fields designed by experts. These questions are Google Proof, so anyone can’t find them online. The Claude score is nearly 59.4%, while the GPT-4o scores only 53.6%. Both the scores are relatively close, but as we can see, Claude could be a better option in tasks that require advanced analytical thinking, such as research analysis, complex problem solving, and high academic level problems.
  • Undergraduate level knowlege(MMLU)
    The MMLU, which means Massive Multitask Language Understanding, is a benchmark that explains the general knowledge understanding of any AI model across various subjects at an undergraduate level. Claude 3.5 Sonnet scores 88.3% in this experiment, and the GPT-4o scores 88.7%. This shows how both LLMs have trained in various domains and have a deeper understanding of them. It makes the AI model a well-suited tool for general knowledge tasks, basic tutoring of multiple subjects, etc.
  • Code(HumanEval)
    HumanEval is a benchmark that evaluates the model’s ability to generate, understand, and debug code. This benchmark is where Claude 3.5 Sonnet achieves 92%, and GPT-4o scores 90.2%. Claude 3.5 Sonnet results are spectacular in this task as it provides a better coding environment, “Artifacts,” and better code generation than GPT-4o. Claude allows the users to design, edit, and run the code in the Artifacts pop-up window. After the launch of Claude 3.5 Sonnet, everyone is developing tools, websites, and basic games and sharing them across the internet. On the other hand, GPT-4o also scored well, but it does not have any coding environment in its interface, so the developers must do too much hassle as the code generated by it is too much hassle to get to the result.
  • Reasoning Over Text(DROP, FLscore)
    The DROP(Discrete Reasoning Over Paragraphs) is the benchmark that measures the model’s ability to understand complex text information. In this challenge, the Claude 3.5 Sonnet scores 87.1%, while the GPT-4o scores 83.4%. This shows that the Claude 3.5 Sonnet is better and more effective for the task, which involves detailed text analysis, text review, complex question-answering systems, etc.
  • Math problem solving(MATH)
    This test evaluates the ability of any AI model to solve various mathematical problems. Claude 3.5 Sonnet scores just 71.1%, while the GPT-4o scores 76.6%. These scores make the GPT-4o a better model for mathematical problem-solving tasks and can be used for mathematical computations such as financial modeling, scientific calculations, and advanced data analysis.
  • Multilingual Maths (MSGM)
    This factor describes the ability of any AI model to solve mathematical problems in multiple languages. Both models get scores close to each other: GPT-4o 90.5% and Claude 3.5 Sonnet 91.6%. This shows that both models perform excellently, with Claude slightly better. The capability is particularly helpful for educational applications or any scenario where mathematical reasoning needs to be communicated across language barriers.
  • Visual question answering(MMU/val)
    This factor describes the LLM’s capability to analyze the information presented in images. The GPT-4o outperforms Claude’s 3.5 Sonnet in this benchmark with 69.1% and 68.3%, respectively. On the other hand, when analyzing text from the document, Claude’s 3.5 Sonnet score is 95.2% compared to GPT-4o’s 92.1%.
  • Image Generation
    Image Generation is the ability of the LLMs to generate images from the text. GPT-4o is integrated with DallE-2 and can produce images with the help of text, and the results are excellent. On the other hand, Claude 3.5 Sonnet cannot create any images. This feature also helps GPT-4o design websites and references better, as it is trained on many images.
  • Knowledge Cutoff
    Here, both the models trained on a limited data set till a specific date. Claude 3.5 Sonnet trained on data till April 2024, while the other hand, GPT-4o trained on data till 2024. The real advantage of GPT-4o is that it has real-time web browsing, which helps the LLM train on new data regularly.

Pros of GPT-4o:

  • Handles voice, images, and video input.
  • Real-time web browsing capability.
  • Faster response time (0.32 seconds average).
  • Superior in math problem-solving.
  • Can generate images using DALL-E 2.

Cons of GPT-4o:

  • Slightly lower performance in graduate-level reasoning.
  • No built-in coding environment.
  • A lower score in document visual Q&A.
  • Slightly behind in code generation capabilities.
  • Less effective in detailed text analysis.

Pros Claude 3.5 Sonnet:

  • Excels in graduate-level reasoning.
  • Superior code generation and built-in “Artifacts” feature.
  • Better performance in detailed text analysis.
  • A higher score in document visual Q&A.
  • Slightly better in multilingual math.

Cons Claude 3.5 Sonnet:

  • Cannot handle voice or video input.
  • No image generation capability.
  • Slightly lower performance in visual question-answering.
  • Cannot access real-time web information.
  • Weaker in math problem-solving.

Conclusion

GPT-4o and Claude 3.5 Sonnet demonstrate impressive capabilities across various tasks, each with its strengths. GPT-4o excels in multimodal inputs, real-time information access, and image generation, making it versatile for diverse applications. Claude 3.5 Sonnet shines in complex reasoning, code generation, and detailed text analysis, offering superior performance in specific academic and professional contexts. The choice between these models depends on the specific use case and required features. We can expect further improvements and specialized models catering to different needs as AI technology advances.

--

--