Analysis of Anthropic’s Claude 3 Sonnet on Amazon Bedrock

Vaishnavi R
Version 1
Published in
7 min readMar 12, 2024

Anthropic launched the Claude 3 model family on March 4th, 2024, featuring three state-of-the-art models in ascending order of capability: Haiku, Sonnet, and Opus. This fresh lineup from Anthropic has undergone benchmark analyses, which shows that the Opus, the largest model in the series, is likely the first model to surpass OpenAI’s GPT-4.

Claude 3 Benchmark

Brief Overview of Claude 3 models

Claude 3 model family

Claude 3 Haiku:

The fastest and most cost-effective model choice, it swiftly addresses simple questions and requests, creating smooth interactions resembling human responses.

Claude 3 Sonnet:

Sonnet performs tasks twice as quickly as both Claude 2 and Claude 2.1. It excels in both speed and intelligence, making it well-suited for enterprise use cases.

Claude 3 Opus:

Opus sets a new standard, surpassing other models like OpenAI’s GPT-4 in the domains of reasoning, math, and coding. It excels in handling highly complex tasks, establishing a new benchmark for intelligence within its family.

Model Availability

You can access the Claude Sonnet model for free via Claude.ai. To use the Opus model, you need to subscribe to Claude Pro, which costs $20 per month + tax. Opus and Sonnet are widely available in 159 countries. Claude Haiku will be available soon.

Key features

  • Multilingual capabilities: Claude 3 models can speak languages like Spanish and Japanese better than before. This upgrade means they can be used for things like translation and making content for people worldwide.
  • Vision and image processing: All Claude 3 models are capable of processing and analyzing visual input, extracting insights from documents, handling web UI, generating metadata for image catalogs, and more. See the vision page to learn more.

Training Process

Claude was trained with a focus on being helpful, harmless, and honest. Anthropic employed a technique called ‘Constitutional AI’ (CAI) to align Claude with human values during reinforcement learning, explicitly specifying rules and principles based on sources such as the UN Declaration of Human Rights. Claude follows the RLAIF (Reinforcement Learning from Artificial Intelligence Feedback) approach, not RLHF.

Check out this blog: Analysis of Claude: An AI Assistant by Anthropic for more information.

Does AWS Bedrock retain users’ data?

  • Amazon Bedrock follows the AWS shared responsibility model for data protection.
  • It doesn’t use your prompts and continuations to train any AWS models or distribute them to third parties.
  • Model providers upload their models to an escrow account, which the Amazon Bedrock inference account can call.
  • Model providers don’t access Amazon Bedrock logs or customer prompts.
  • Amazon Bedrock doesn’t store or log your data in its service logs.

Check out this to know more: docs.aws.amazon.com

Claude 3 Sonnet on Amazon Bedrock

Amazon Bedrock is a machine learning platform that helps developers build generative AI applications.

At the moment, Amazon Bedrock only offers access to the Claude Sonnet model, while Opus and Haiku access are not yet available.

Tested Use Cases

1. Summarization of large documents and question answering:

Claude Sonnet effectively summarized a given PDF of approximately 1880 words, highlighting key points and accurately answering related questions.

But when asked to summarize a book named ‘The Quantum Structure of Space and Time,’ which had 295 pages and about 100,000 words, the Sonnet model on Bedrock kept running for over 5 minutes without giving any response.

Output: Claude Sonnet on Amazon Bedrock

2. Code generation from natural language prompt:

To evaluate Claude Sonnet’s code-generation ability from natural language prompts, the below prompts were given with the ‘Maximum Length’ of 4096 tokens allowed in the generated output.

Prompt: “Write a high-quality Python script for the following task, something a very skilled Python expert would write. Make sure to include any imports required. After generating the code check your work carefully to make sure there are no mistakes, errors, or inconsistencies. If there are errors, list those errors in <error> tags, then generate a new version with those errors fixed. If there are no errors, write “CHECKED: NO ERRORS” in <error> tags.

Here is the task:
<task> Write a Python program for a Customer Relationship Management (CRM) system. The system should include the following features:
1. Add a new customer with details such as name, email, and contact number.
2. Update customer information, allowing changes to name, email, or contact number.
3. Display a list of all customers with their details.
4. Implement a search functionality to find a customer by name or email.
5. Generate a report summarizing customer details and interactions.
Double-check your work to ensure no errors or inconsistencies.</task>”

Output: Claude Sonnet on Amazon Bedrock

The Sonnet model successfully generated the code with precision and completeness. The code included all the essential components outlined in the prompt, showcasing proficiency in class design, methods, and CSV file handling. Additionally, the model included an example usage section. Finally, it generated a well-structured customer report, saved as “customer_report.csv.

Output: GPT-4 Turbo

Similarly, the GPT-4 Turbo model also generated a well-structured and functional code.

3. Code conversion:

To test Claude Sonnet’s code conversion capabilities, a prompt was provided containing around 450 lines of Java code with the task of converting it into equivalent Python code.

Output: Claude Sonnet on Amazon Bedrock

The Claude Sonnet model successfully converted all the code into Python and subsequently provided an explanation detailing the functionalities incorporated in the converted code.

However, the GPT-4 turbo model in Azure OpenAI studio was only able to convert 130 lines, including comments, before the conversion process halted.

Output: GPT-4 Turbo

4. Claude Vision Capabilities:

Anthropic stated that Claude 3 models have sophisticated vision capabilities. They can process a wide range of visual formats, including photos, charts, graphs, and technical diagrams.

The Sonnet model accurately generated a description for an image featuring a scenic view. However, it provided incorrect information when asked to describe an image containing graphs.

Tokens and Pricing

The Claude 3 models all have a context window size of 200k, which is about 150k words. However, the Claude Sonnet model, despite having this large context window, has a maximum output length limit of only 4096 tokens. These tokens include words, punctuation, spaces, and more, roughly equivalent to around 3000 words in everyday language.

Shifting the spotlight to the GPT-4 Turbo (1106 preview), it operates with a context window size of 128k. However, a similar restriction is seen in its maximum response length, capped at 4096 tokens.
Refer to this to learn more about ‘Tokens’.

The above graph shows all the models with their associated costs for prompts and completions per 1000 tokens.

For prompts:
* Claude Haiku is the most cost-effective at $0.0003 per 1000 tokens, making it the most affordable option for initiating requests.
* Claude Sonnet follows at $0.0030.
* Claude Opus is priced at $0.0150 for prompts, making it more expensive than GPT-4 Turbo, which is priced at $0.0100. Therefore, Opus currently holds the position of being the costlier option.

For completions:
* Again, Claude Haiku stands out as the most economical choice at $0.001 per 1000 tokens.
* Claude Sonnet is moderately priced at $0.015.
* Claude Opus is the priciest for completions at $0.075, while GPT-4 Turbo is priced at $0.0300.

For businesses, the choice of models can impact budget considerations, with Claude Haiku offering an economical option for various applications. GPT-4 Turbo (128k) provides a balance between cost and capabilities, making it suitable for businesses seeking a middle-ground solution. Claude Sonnet, while robust in performance, comes at a moderate price. Opus is the most expensive one.

Conclusion

In conclusion, Amazon Bedrock ensures robust data protection by strictly preventing model providers from accessing user data or service logs, and prioritizing privacy and confidentiality in its secure environment.

Claude Sonnet demonstrated proficiency in summarizing shorter texts, providing accurate insights, and answering questions effectively. However, it faced challenges summarizing a longer book, indicating potential limitations with larger content processing.
Notably, Claude Sonnet’s code-generation and conversion abilities were impressive, outperforming GPT-4 Turbo in Azure OpenAI Studio. Vision capabilities demonstrated accuracy for scenic images but faltered with graphs.

Token-wise, all Claude 3 models share a 200k context window, and Sonnet’s maximum output is 4096 tokens, roughly 3000 words. GPT-4 Turbo has a 128k context window and the maximum output token limit is 4096.

Considering the overall affordability for both prompts and completions, Claude Haiku is most affordable for prompts and completions. GPT-4-Turbo (128k) falls between Claude Haiku and Claude Sonnet in terms of cost.

Businesses should weigh all these factors based on their specific needs and financial considerations when integrating language models into their applications and workflows.

About the Author
Vaishnavi R is a Data Scientist at the Version 1 AI Labs.

--

--

Vaishnavi R
Version 1

Junior Data Scientist at the Version 1 AI & Innovation Labs.