AWS Bedrock’s Claude-2-100k vs. Azure OpenAI’s GPT-4-32k: A Comparative Analysis

Vaishnavi R
Version 1
Published in
9 min readOct 20, 2023
(Image by Bing Image Creator)

If you’ve been following AI news recently, you’ve probably heard about the popular GPT-4 model by OpenAI and Claude-2, a large language model by Anthropic, which are generating a lot of attention. It’s worth noting that the latest version of GPT-4 comes with a token size of 32,000 tokens, while the latest iteration of the Claude model stands out with an impressive token limit of 100,000 tokens.

Introduction:

Anthropic and Amazon share a commitment to ensuring the safe use of advanced foundation models in training and deployment. AWS has its generative AI service called Bedrock. Part of this is Claude. Claude-2 is the latest system released by Anthropic to date.

Many companies in various fields are using Anthropic models with Amazon Bedrock to develop their projects. It is available to customers on the AWS platform and can be accessed via an API. So, businesses can more easily create and expand applications that are powered by AI.

In the previous articles, we looked at how the Claude model responds to complex math problems, its knowledge of geography, and its ability to understand and respond to sentiment analysis, etc. You can check the below links to learn more about the analysis done on different versions of the Claude model.

What is AWS Bedrock?

AWS Bedrock is a service that gives you access to Foundation Models via APIs. With Amazon Bedrock you can get easy access to Foundation Models from the top AI start-up model providers.

Note:
Foundation model: Generative AI is powered by Large Language Models (LLMs) and these LLMs are pre-trained on vast amounts of data known as Foundation models (FM).

This means you can choose the right model for your business without having to worry about setting up and managing infrastructure. So, with AWS Bedrock you can choose a base foundation model and then train it further based on your data in a secure way.

Your data is encrypted and never leaves your Amazon VPC (Virtual Private Cloud). This ensures that your data remains confidential.

To read more about AWS Bedrock check out this blog —

GPT-4 in Azure OpenAI Studio

OpenAI and Microsoft both provide API services to utilize OpenAI’s GPT models. Using Microsoft’s Azure Open AI service, you can access GPT-4 models.

(Azure Open AI Studio Interface)

Tested Use-Cases

1] Code conversion — Java & Python

In our evaluation, it appears that both Claude-2-100k in AWS Bedrock and GPT-4-32k from Azure OpenAI demonstrated similar levels of capability when it comes to code conversion, although they exhibited distinct behaviour.

Java to Python Conversion:

  • Claude-2-100k generated 19 lines of Python code when provided with 409 lines of Java code. However, it stopped prematurely. This suggests that Claude-2-100k might have limitations in handling large codebases or may need further fine-tuning for this specific task.
  • GPT-4-32k, on the other hand, produced code in chunks for each class in Python. This behaviour can be beneficial as it allows users to assess and integrate the generated code in parts. However, it also stopped prematurely.
(Code conversion using Claude-2–100k in AWS Bedrock)
(Code conversion using GPT-4–32k in Azure OpenAI)

Python to Java Conversion:

  • When Claude-2–100k was given 350 lines of Python code to get equivalent Java code, it produced only 28 lines of Java code before stopping.
  • GPT-4–32k, once again, displayed a similar pattern by giving only a few lines of code and missing some functionalities in the conversion from Python to Java.
(Python to Java conversion using Claude-2–100k in AWS Bedrock)
(Python to Java using GPT-4–32k in Azure OpenAI)

This evaluation showed that both models share similar limitations. When using these models for code conversion, users should be prepared to review and refine the generated code to ensure correctness, completeness, and adherence to their specific coding standards.

Note:
When the Claude model was used via its official website, it successfully converted nearly 100 lines of code from Python to Java and approximately 138 lines of code from Java to Python.

2] Code generation from natural language prompt

To evaluate the code generation capabilities of GPT-4–32k and Claude-2–100k, we provided them with a specific prompt: “Provide a Java program that involves all the functionalities of a Tourism Agency.”

Claude-2–100k’s Response:

Claude-2–100k presented a basic structure of the program, including the creation of classes like Booking and Customer class, and specified the required packages.

(Screenshot of Claude-2–100k in AWS Bedrock)

GPT-4–32k’s Response:

On the other hand, GPT-4–32k’s response was similar to Claude-2–100k’s. It provided a simple representation of the system with basic functionalities. It just outlined the structure of the program.

(Screenshot of GPT-4–32k in Azure OpenAI)

However, GPT-4–32k & Claude-2–100k did not provide the full coding for the Tourism Agency program. This response showed that Claude-2–100k, just like GPT-4–32k, displayed a comparable level of proficiency in generating code from natural language.

3] Book summarization & question answering

Claude-2’s impressive context window of 100,000 tokens, which is approximately equal to around 75,000 words, indeed signifies its ability to process a substantial amount of text, equivalent to hundreds of pages.

To evaluate this capability, we conducted a test using the book ‘Quantum Physics for Dummies,’ which contained approximately 59,950 words and spanned 338 pages. Then instructed Claude with the prompt as follows:

“You are an expert in writing summaries. Read the book provided below and write a summary. <book>…</book>”

(Screenshot of Claude-2–100k in AWS Bedrock)

Claude-2–100k took more than 10 minutes to generate a response, and the summary provided seemed to be more like generic information about what quantum physics is and the concepts included in it, rather than a precise summary of the book.

On the other hand, Claude-2–100k displayed impressive question-answering accuracy when tested with specific questions related to the book. For instance, when asked,

“From the above book, can you name what are the contents explained in Chapter 14?”

Claude-2–100k provided an accurate response.

(Screenshot of Claude-2–100k in AWS Bedrock)

GPT-4 provides a token limit of 32,000 tokens, which is roughly equivalent to 24,000 words. While GPT-4–32k is a powerful language model, it does have a significantly lower token limit compared to Claude-2–100k.

This limitation becomes apparent when the prompt’s token size exceeds 32,000 tokens, as GPT-4 will show a “Token limit error.

(Screenshot of GPT-4–32k in Azure OpenAI)

4] Document summarization and analysis

Next, both the models were assessed by providing a small, six-page document.

(Screenshot of Claude-2–100k in AWS Bedrock)
(Screenshot of GPT-4–32k in Azure OpenAI)

Both Claude-2–100k and GPT-4–32k showcased their remarkable capabilities by accurately identifying crucial details from the research paper. These details included the title, author, published date, journal, and issue of the paper. Moreover, both models were able to provide accurate explanations of the paper’s background ideas and new contributions.

5] Data analysis

When comparing Claude-2–100k and GPT-4–32k for data analysis tasks using a dataset on CO2 emissions per capita, here’s a comparison of their performance based on the given scenario:

  • Initial Data Analysis and Summary
    Both Claude-2–100k and GPT-4–32k provided accurate and specific summaries of the CO2 emissions dataset when prompted with “Analyse the below data and provide a summary.”
    They both demonstrated their ability to understand and summarize data effectively.
(Screenshot of Claude-2–100k in AWS Bedrock)
(Screenshot of GPT-4–32k in Azure OpenAI)
  • Response Time:
    Claude-2–100k had a longer response time compared to GPT-4–32k. GPT-4–32k was notably quicker in responding to questions. This could be a critical factor when dealing with real-time or time-sensitive data analysis.
  • Question Answering:
    Both models excelled in answering a series of questions related to the dataset. Questions like “Which countries have the highest CO2 emissions per capita in 2020?” & “How have CO2 emissions per capita changed globally from 2006 to 2020?” were answered accurately by both models.
(Which countries have the highest CO2 emissions per capita in 2020?)
(Which countries have the highest CO2 emissions per capita in 2020?)

6] Checking Math Ability

It’s interesting to hear about the differences in performance between Claude-2–100k on AWS Bedrock and GPT-4-32k from Azure OpenAI for math-related questions.

Claude-2–100k provided an incorrect answer to a simple differential equation question. But gave correct answers to simple algebra and number series questions.

(Screenshot of Claude-2–100k in AWS Bedrock)

On the other hand, GPT-4–32k provided correct answers to all math questions.

(GPT-4–32k: Differential Equation Question)
(GPT-4–32k: Algebra Question)
(GPT-4–32k: Number Series Question)

So, GPT-4–32k performed better in solving math problems and in providing explanations for math-related questions compared to Claude-2–100k in the given tests.

Conclusion

  • Code Generation & Code Conversion: Both models perform similarly in these tasks. Both models can be considered tools to aid developers rather than completely replace them in the software development process.
  • Book Summarization: Claude-2 exhibits a remarkable capacity to analyse and process extensive textual information with its expansive context window. So, for summarizing or analysing large volumes of text, such as books, the Claude 100k model is the best option.
  • Small Document Analysis: Both models perform similarly.
  • Data Analysis: GPT-4–32k has a significant advantage in terms of response time, which may be a crucial factor in real-time data analysis scenarios.
  • Math Skills: GPT-4 excelled in all mathematical problem types, offering not only accurate solutions but also insightful explanations compared to Claude-2.

Overall: While Claude-2–100k offers a larger context window with 100,000 tokens, GPT-4–32k slightly outperforms Claude-2 in various aspects.

However, it’s important to keep in mind that extensive context, as offered by Claude-2–100k, can be advantageous for tasks requiring in-depth analysis of large datasets, a capacity that GPT-4’s 32k token limit cannot match. Accepting more input allows for a richer understanding of data, which can be incredibly valuable in real-world applications.

Ultimately, choosing the right model depends on the specific requirements of your project and the balance between context and performance.

About the author

Vaishnavi R is a Junior Data Scientist at the Version 1 AI Labs.

--

--

Vaishnavi R
Version 1

Junior Data Scientist at the Version 1 AI & Innovation Labs.