A Battle of the LLMs: Finding Your Model Match

Published in

CyberArk Engineering

7 min readSep 27, 2023

Welcome to the world of language models, where the magic of understanding and generating human language comes to life. These models excel at capturing the nuances of patterns, semantics, and context from extensive reservoirs of text, forming the core of groundbreaking conversational experiences seen in ChatGPT and Bard.

ChatGPT, driven by GPT-3.5 and GPT-4, crafts natural dialogues while Bard dances with LaMDA for creative expression. Anthropic’s Claude 2 adds to this tapestry of innovation, hinting at a future brimming with linguistic possibilities. In this blog post, we’ll delve into a variety of these remarkable language models, exploring their capabilities and applications in depth.

ChatGPT vs. Bard vs. Claude 2: How are they Different?

Let’s examine two simple examples to gain a broader understanding of the differences among large language models (LLMs).

I posed the question to ChatGPT, Bard, and Claude 2: “Who is the prime minister of Israel?”

ChatGPT’s response to “Who is the prime minister of Israel?”

ChatGPT responded with Naftali Bennett, reflecting its data cutoff in late 2021.

Meanwhile, Claude believes it is Yair Lapid, as its data remains relevant until late 2022. Finally, Bard, equipped with connectivity to the most recent data, provided the answer: Benjamin Netanyahu, who currently holds the position of prime minister. Thus far, it is evident that Bard possesses access to relevant data, proving highly valuable in specific scenarios.

In the second scenario, I tasked the chatbots with simulating professional smartphone sellers from 2020. My objective was for the models to recommend suitable devices to customers visiting the store, specifically seeking a phone with an extended battery life.

Here are the outcomes:

ChatGPT recommended several high-capacity battery devices, including the Samsung Galaxy S20 Ultra, Google Pixel 4a, iPhone 11 Pro Max, and iPhone SE (2020).

When Claude took the stage, it suggested devices with smaller battery capacities, such as the iPhone 11 and Samsung Galaxy S20 FE.

Bard's response to the smartphone assessment

Lastly, Bard generated a well-organized list of results accompanied by images. However, it needed some clarification within the instructions, possibly due to the devices being newer models (like the 2022 iPhone 14 Pro Max).

The Role of Tokens in LLMs

Alright, we’ve explored two simple examples. Are we ready to select the ideal fit for my use case? Not quite yet. Before delving into metrics, distinctions, and evaluations, we need to grasp the concept of tokens. Tokens serve as the fundamental text units for LLM input. For instance, a 750-word document comprises roughly 1,000 tokens. Take the word “eating”, broken down into “eat” and “ing”. As prices and limits are often per 1,000 tokens, comprehending this concept is crucial before proceeding.

Choosing the Right Model for the Task

Several factors need to be considered when deciding to highlight a model, including:

Usage: Will I be using a chatbot like ChatGPT or APIs?
Token limit: What types of questions will I be posing? Or perhaps working with large documents?
Cost: What are the prices per 1k tokens or for subscriptions?
Processing time: Is it essential for responses to be generated instantly?

As we continue our exploration of Large Language Models, we’ll now introduce and compare Llama 2, Jurassic, Titan, and PaLM 2, showcasing their unique strengths alongside GPT, LaMDA, and Claude 2 in various contexts.

We start with three use cases to address basic metrics:

Content writing. Consider your needs in terms of creativity and quality. Opt for GPT-3.5 or GPT-4 for more significant needs and Llama 2 for straightforward social media content.
Chatbots. Factor in user count and complexity. Llama 2 suits limited requests and budgets, while GPT-3.5 handles more demands. Reserve GPT-4 for expert domains like healthcare diagnostics.
Personal use. For creative tasks, GPT is a great choice. For longer texts, consider the effectiveness of Claude 2. If relevance is your main concern, Bard is the more accurate option.

Comparing and Evaluating Models

In addition to token limit, cost, and processing time, the following metrics can provide us with insights:

Model size: Total parameters the model was trained on.
General knowledge: Awareness of a wide range of information.
Logical reasoning: Ability to infer logical relationships.
Coding abilities: Capability to generate and assist with coding.
Availability: Where the model is accessible and deployable (Azure, AWS, OpenAI, etc.).

Comparing by model size

As a rule of thumb, larger models handle complex tasks better.

Model Parameters GPT-4 around 1T to 1.7T PaLM 2 540B Jurassic 1 178B GPT-3.5 around 154B to 175B Claude 2 137B

B = billion.

T = Trillion (=1,000 billions)

Comparing models by token limit

Models with a larger token limit excel at processing lengthier texts. For instance, imagine a situation where you must formulate questions based on multiple extensive documents. Claude excels in this context, efficiently and swiftly processing documents with its remarkable 100k token limit.

Model Token limit Claude 2 100k GPT-4 32k GPT-3.5 16k Jurassic 2 8k PaLM 2 8k Llama 2 4k

Evaluating models by processing time

In scenarios where processing time is crucial, models like AI21’s Jurassic and Google’s PaLM 2 demonstrate impressive efficiency. Most models I tested generated rapid responses.

As a Border Collie owner, I assessed processing time and word output by having the models generate a sub-100-word blog post using the prompt, “A 100-word blog post on Border Collies.”

Jurassic was the fastest, generating 65 words in 2.28 seconds.
Claude proved the slowest, taking 6.62 seconds for 103 words.
Other models fell in between, ranging from 3 to 6 seconds, producing over 120 words.

After shifting to a 500-word topic on Border Collies, the results were:

PaLM 2 (5.62 seconds) outpaced Jurassic (9.89 seconds), yielding 473 versus 336 words.
GPT-3.5, Titan, and Claude took over 20 seconds.
Only Claude 2 kept below 500 words.

For a 1,000-word attempt, Jurassic led again but with a mere 196 words. PaLM managed 508 words in 8.29 seconds. GPT-3.5, Titan, and Claude needed over 20 seconds, yielding around 550 words each.

Evaluating models by logical reasoning

I crafted a prompt for logical reasoning assessment featuring two statements and two conclusions. Using these elements, models had to select the correct answer from four options. While some models faced challenges in picking the correct answer, GPT-3.5, GPT-4, and Llama2 provided accurate responses. The latter two (GPT-4 and Llama 2) explained the underlying relationships.

Example:

Statements: 1. All mammals have lungs. 2. Dolphins are mammals.

Conclusions: 1. Dolphins have lungs. 2. All animals with lungs are mammals.

Options:

A. Both conclusions are true.

B. Both conclusions are false.

C. Only conclusion I is true.

D. Only conclusion II is true.

Evaluating models by general knowledge

In the general knowledge quiz, I posed questions such as “Pink Ladies and Granny Smiths are types of what fruit?” and “What is the only flag that does not have four sides?” GPT stood out from the rest with the most precise responses, while Titan and Jurassic performed the least accurately.

Evaluating models by coding assessment

The HumanEval benchmark comprises programming problems, and the scores are determined by the accuracy of the generated responses.

Here are the outcomes:

Model Result Claude 2 71.2 GPT-4 67 GPT-3.5 48.1 Llama 2 29.9 PaLM 2 26.2

Furthermore, I tackled a brief coding question where the models were tasked with crafting a method to count words in a string.

Here are two examples:

Llama’s initial solution was excellent. However, it attempted to become smarter, but in a flawed instance, it detected seven words in a text containing only six. Subsequently, the second method, which involved punctuation counting, did not yield effective results.

On the other hand, Claude’s response was notably clear and straightforward. It included explanations and examples before presenting the solution.

Evaluating models by availability

This aspect holds critical importance given that models are accessible from various platforms.

GPT can prove highly beneficial if you are using Azure. You can deploy the model with the assistance of Azure OpenAI Service.

Claude, Titan, and Jurassic are suitable for those utilizing AWS, as they are accessible through AWS Bedrock. It’s important to note that Bedrock is currently in private beta and not yet in production.

Llama 2 is available on both Azure ML and AWS SageMaker. Additionally, specific models are self-contained, such as GPT from OpenAI and Jurassic from AI21.

Recap of all model evaluations

Let’s consolidate everything:

Model Token limit Processing time Logical reasoning General knowledge Coding exam score Available in GPT-4 (OpenAI) 32k Slow 5 (best) 6 (best) 67 OpenAI Azure GPT-3.5 (OpenAI) 16k Fast 3 5 48.1 OpenAI Azure Llama2 (Meta) 4k Fast 4 4 29.9 Azure AWS Google Cloud PaLM 2 (Google) 8k Very fast 2 4 26.2 Google Cloud Claude 2 (Anthropic) 100k Fast 2 3 71.2 Anthropic AWS Jurassic 2 Ultra (AI21) 8k Very fast 2 2 ? AI21 AWS Titan (AWS) 8k Fast 1 1 ? AWS — LLMs comparison

GPT-4 emerges as the top-performing model for logical reasoning and general knowledge. PaLM and Jurassic exhibit exceptional speed. Claude boasts an incredible token limit of 100k, making it the optimal choice for coding tasks.

There Isn’t One Best Model for All

In my perspective, the concept of a “best model” lacks a definitive answer. Instead, our selection should be driven by the model that best aligns with our needs. It’s vital to remain conscious of pricing, as improper utilization can lead to substantial costs. Fortunately, both ChatGPT, Claude, and Bard are accessible without charge. Personally, I find utility in employing all three models.