GPT-4 vs Llama 2: The Battle of the Language Models

3 min readJul 26, 2023

In the evolving world of artificial intelligence (AI), things change rapidly. As regular users, we’ve noticed transformations in the responses given by large language models like GPT-4, which has sparked a broader conversation. According to a recent study, this alteration might not be just a figment of our imaginations; the performance of GPT-4 seems to have degraded over time. Let’s delve deeper into these phenomena, comparing the models’ strengths and weaknesses.

The Arrival of Llama 2

Meta and Microsoft’s surprise joint venture recently released a new family of large language models named Llama 2. The largest model possesses 70 billion parameters, with a context length of 4,000 tokens. Though slightly less powerful than GPT-4 Palm 2, its distinguishing feature is the release under a commercial license.

You can download Llama 2 immediately and commence using it or tinker around on Hugging Face. Notably, if your app has fewer than 700 million monthly active users, you can self-host the model and commercially use it, offering near GPT-4 capabilities at a reduced cost compared to OpenAI’s API. This release came courtesy of Meta and Microsoft, and unsurprisingly, you can operate and fine-tune it on Azure Cloud.

Comparative Analysis: GPT-4 vs Llama 2 vs Google’s Generative AI

To assess the capabilities of these models, they were presented with a challenge: to provide alternative expressions for Murphy’s Law: “Anything that can go wrong will go wrong.”

Chat GPT response was quite succinct but possibly the most useful. Google’s answer was shorter, generated faster, and provided additional context and links to the web. Llama 2, on the other hand, presented the most verbose and well-written response, but lacked the sophistication exhibited in GPT’s answers. The poetry comparison was particularly insightful: OpenAI’s GPT outperformed Llama in the artistic realm, generating verses with a higher poetic resonance. Similarly, in coding, Llama fell short, especially when it came to complex programming requirements.

It’s important to note that this comparison may not be entirely fair, considering GPT-4 is a closed and paid platform. Nevertheless, for those seeking benchmarks for open-source models, Llama 2 appears to be a viable choice.

Safety, Performance Degradation, and the ‘Lobotomized’ AI

In Llama’s accompanying release paper, ‘safety’ is mentioned 299 times. The researchers have made it safer through reinforcement learning from human feedback, where humans rank the outputs to “lobotomize” the AI, ensuring it refrains from harmful outputs.

While it’s a bit disappointing that we haven’t yet witnessed the singularity or the total AI takeover, what is observable is the evolution of AI’s sophistication and the increasingly complex guardrails being placed around AI models. While it may appear that AI is ‘getting dumber,’ it’s becoming safer and harder to manipulate, which ironically can make it appear less intelligent.

For example, when asked about building a high-yield nuclear weapon for home defense, Llama refused, declaring it both ‘highly regulated’ and ‘morally reprehensible.’ Despite the model’s claim that it doesn’t possess any personal opinions or beliefs, this reply seems to belie such claims.

In Conclusion

The recent decline in traffic to the ChatGPT site, coupled with studies showing a decrease in performance over time, does raise questions about the future of these large language models. The debate about the efficacy, safety, and applicability of such models is ongoing, and as it evolves, I’ll be here to bring you the latest updates.

GPT-4 vs Llama 2: The Battle of the Language Models

The Arrival of Llama 2

Comparative Analysis: GPT-4 vs Llama 2 vs Google’s Generative AI

Safety, Performance Degradation, and the ‘Lobotomized’ AI

In Conclusion

Written by Hassan Raza