GPT-4 vs Llama-3.1 vs Claude 3.5
With the release of Llama 3.1, the internet is buzzing with posts claiming it beats GPT-4.0 in most benchmarks, suggesting that open source has finally caught up with closed source.
Here are a few key points:
- Llama 3.1–405B is still lagging behind GPT-4.0 and Claude Sonnet 3.5 in the most important tasks.
- Public benchmarks are so contaminated that they’re no longer a reliable measure of progress.
- There are other reliable ways to measure the models, especially their reasoning capabilities.
- A small, focused team can rival and even outdo a much larger one. You don’t always need a massive budget (it’s not all about more GPUs).
Among the leading models are Llama 3.1, GPT-4.0, and Claude 3.5. Each model brings unique capabilities and improvements, showcasing the ongoing advancements in AI tech. In this analysis, we’ll dive into these three big models, focusing on their strengths, architectures, and potential uses.
Llama 3.1: Open Source Innovation
Llama 3.1, developed by Meta, is a big leap forward for the open-source AI community. A standout feature is its expanded context length to 128K, which allows for a deeper understanding and processing of text. Llama 3.1–405B, the largest model in the series, offers unmatched flexibility and top-tier capabilities, competing with the best closed-source models.
The model’s architecture is based on a standard decoder-only transformer with tweaks for scalability and stability. This, combined with iterative post-training processes, boosts its performance across various tasks. Llama 3.1 is especially notable for its support of eight languages and its ability to handle complex tasks like synthetic data generation and model distillation, which is a first at this scale for open-source AI.
In terms of its ecosystem, Meta has teamed up with big names like AWS, NVIDIA, and Google Cloud, making sure Llama 3.1 is accessible and integrable across multiple platforms. This openness spurs innovation, letting developers customize models to meet their needs, do additional fine-tuning, and deploy in various environments without data-sharing hassles.
GPT-4.0: Versatility and Depth
GPT-4.0, a version of OpenAI’s GPT-4, is made to balance versatility and depth in language understanding and generation. This model produces coherent, contextually accurate text for a range of applications, from creative writing to technical documentation.
GPT-4.0’s architecture builds on the strengths of earlier models, using extensive pre-training on diverse datasets followed by fine-tuning for specific tasks. This approach lets the model grasp nuanced language and adapt smoothly to various contexts. Its strong performance in benchmarks and real-world applications underscores its robustness and reliability as a general-purpose language model.
A key feature of GPT-4.0 is its integration with different tools and APIs, boosting its functionality in practical applications. Whether supporting customer service, creating content, or solving complex problems, GPT-4.0 delivers a seamless user experience with high accuracy and efficiency.
Claude 3.5: Speed and Precision
Claude 3.5, developed by Anthropic, aims to set a new benchmark for intelligence, focusing on speed and precision. The Claude 3.5 Sonnet model, part of this series, surpasses its predecessors and competitors in key areas like graduate-level reasoning, coding skills, and handling complex instructions.
Claude 3.5 Sonnet operates at twice the speed of its predecessor, Claude 3 Opus, making it perfect for tasks that need quick responses, like context-sensitive customer support and multi-step workflows. The model also excels in visual reasoning, outperforming previous versions on standard vision benchmarks and effectively managing tasks that involve interpreting charts and graphs.
Anthropic has focused on beefing up the safety and privacy features of Claude 3.5, incorporating thorough testing and feedback from external experts. The model’s deployment includes strong safety measures, ensuring it is less prone to misuse and more reliable in critical applications.
Algorithmic Reasoning
Example and Recipe
A simple example to explain this idea. Think about long division. It’s an algorithm so easy that people usually get the hang of it in elementary school. You just follow the steps and do some addition, subtraction, and multiplication with numbers under 100. Even the weakest LLMs know what long division is (just ask them) and how to multiply small numbers (just check it). So, if they mess up this algorithm, it’s because they can’t think through the steps correctly. The real question isn’t if the model can get the right answer — any calculator can do that. The question is if a model can stay focused and follow a long chain of steps accurately.
I suggest trying this prompt: “Use the long division algorithm to divide 14578 by 5576.” I encourage others to ask their favorite LLMs this and pay special attention to the remainder values. Long division is just one example. You can come up with many other problems (even non-math ones) as long as they need step-by-step reasoning and have lots of variations to avoid memorization.
He invites others to create their own questions using these two rules and share their results:
- GPT4o: 67%
- Claude Sonnet 3.5: 71%
- Llama3.1–405B: 54%
- Mistral Big v2: 61%
A few takeaways:
- Even GPT4 scores less than 70%, which isn’t great. These are easy algorithmic reasoning questions. For harder ones, most models score 0%. Surprised? These models still have a long way to go. Check out some papers by Subbarao Kambhampati and his team on other reasoning flaws in LLMs.
- Sonnet 3.5 does slightly better than GPT4o, which matches other reports. I’m really impressed by Anthropic’s work. I just wish they were more supportive of the open-source community.
- Llama3.1 doesn’t do as well as the top closed-source models, which goes against recent online chatter.
Comparison Approach
Besides their biggest model Llama 3.1 405B (which we looked at here), Meta added a performance boost and a 128K context window to their older 70B model.
This analysis mainly compares Llama 3.1 70B with GPT-4o mini and Claude 3.5 Haiku, looking at standard benchmarks and community data.
Cost Comparison
Since Llama 3.1 70B is open-sourced, you have lots of options for running it. You can run it locally or use a hosted version from different providers. Running these open-source models used to be one of the cheapest options, but closed-source models are dropping their prices too.
For example, OpenAI launched a powerful but affordable model (GPT-4o mini) that costs $0.15 per 1M input tokens and $0.6 per 1M output tokens, which is super cheap for a proprietary model.
Also, Claude 3.5 Haiku goes for $0.25/$1.25, and Gemini 1.5 Flash for $0.35/$1.05, which is still pretty low-priced even compared to running Llama 3.1 70B.
Speed Comparison
Open-source models are super speedy with providers like Groq and Fireworks.
The Llama 3.1 70b can crank out about 250 tokens per second, which is pretty awesome. GPT-4o mini isn’t lagging as much as before and can manage 103 tokens. The other two models are even quicker, with Claude 3.5 Haiku pushing out 128 tokens, and Gemini 1.5 Flash hitting 166 tokens.
Latency Comparison
GPT-4o mini runs at 0.56 seconds, Claude 3.5 Haiku is a bit quicker at 0.52 seconds, and Gemini 1.5 Flash takes 1.05 seconds. With Llama 3.1 70b, you’ve got at least four options that can match or even beat the latency of similar proprietary models.
Reported Capabilities — Standard Benchmarks
When new models come out, we get excited to learn about their capabilities from benchmark data shared in the technical reports. Check out the image below to see how Llama 70b stacks up on standard benchmarks compared to the top five proprietary models and one open-source model.
According to testing by Vellum:
- Math Riddles: GPT-4o mini achieved 86% accuracy, leading the pack. Gemini 1.5 Flash came in second with 71% accuracy, followed by Llama 3.1 70b at 64%. Claude 3 Haiku struggled with only 29% accuracy.
- Customer Ticket Classification: GPT-4o mini demonstrated the highest accuracy (72%) and precision (89%), excelling in predicting positives correctly but missing some actual positives. Claude 3.5 Haiku scored the highest F1 score at 75%, indicating a balanced performance, which is beneficial for tasks like spam detection.
- Reasoning Tasks: GPT-4o mini excelled with 63% accuracy, while Claude 3.5 Haiku had the lowest accuracy at 38%.
- Cost of Open-Source Models: Using open-source models through providers isn’t the most cost-effective. Models like GPT-4 mini are more affordable, costing $0.15 per 1M input tokens and $0.6 per 1M output tokens.
- Speed and Latency: Open-source models maintain advantages in speed and low latency, especially when used with providers like Groq or FireworksAI. Running Llama 70b enables various multi-agent workflows that were previously limited.
The Llama 3.1 70b model shows a 15% improvement in math tasks over its previous version, a 12% decline in reasoning tasks, and no change in customer ticket classification.
References
Source for throughput & latency: artificialanalysis.ai
Source for standard benchmarks: https://ai.meta.com/blog/meta-llama-3-1/