Exploring Claude Sonnet’s Multi-Modal Capabilities in Amazon Bedrock

Published in

Artificial Intelligence in Plain English

13 min readMar 10, 2024

In this blog, we delve into the multi-modal capabilities of Anthropic’s Claude Sonnet model, exploring its potential applications within the automotive and manufacturing industries.

Amazon Web Services (AWS) recently announced[1] support for Claude 3 models, including Sonnet, within its Amazon Bedrock service. Boasting a 2x performance improvement[2] over its predecessors, Claude 2 and Claude 2.1, Sonnet offers reduced hallucination, enhanced analysis, forecasting, code generation, increased mathematical capabilities, and multi-lingual support with higher levels of intelligence. One of Sonnet’s key strengths lies in its ability to understand various image formats, such as diagrams, charts, photographs, and scanned documents, in addition to natural language. It supports both image and text modalities, enabling a more comprehensive understanding of complex data. For further details, please refer to the science paper released by Anthropic[3].

In this blog, we will explore Sonnet’s multi-modal capabilities through the following use cases:

OCR — Simple text extraction and extraction from nested structures
Diagram analysis — Interpreting architectural diagrams, industrial piping and equipment diagrams, and analyzing chemical properties through charts — comparing molar mass versus boiling points and some fictitious organic compounds to demonstrate its capabilities beyond its training data
Predictive maintenance — Detecting dents and repairs in cars from images
Multi-modal RAG (Retrieval-Augmented Generation) — Allowing users to supply both text and images as input for querying, comparing, and analyzing data

For the analysis presented in this blog, I have utilized images from the comprehensive cars (CompCars) dataset[4].

You can find the code used in this blog in this Github repo.

Before executing the code, ensure that you have enabled access to the Anthropic Claude 3 Sonnet model from the model access page in the Amazon Bedrock console[5].

Let’s embark on this exploration!

Optical Character Recognition

Let’s begin our evaluation with text extraction, a crucial capability for understanding and interpreting complex documents and images. Consider this page from an automobile manual, vertically divided into two sections. The right portion covers the Tire Pressure Monitoring System (TPMS) and includes a table with recommended tire pressures.

When asked to extract the recommended tire pressures and present the response as a markdown table, the LLM provided the following result:

This output is impressive, as the image had a nested structure, with the table embedded within it. Despite this complexity, the model was able to accurately extract the relevant data for us.

Next, we introduce an image with additional intricacies. This image, sourced from a car service center, lists customer requests, recommendations based on the inspection by the service center, failed inspections or warning signs observed, whether the customer approved the recommended actions, and their associated costs.

When instructed to provide the list of customer requests and the warning signs observed during the inspection — which falls under the second section of the image — the model responded as follows:

As you can see, the extraction accuracy is remarkable. The model successfully extracted the customer requests and caution items, despite extracting additional sections that were not part of the initial request. With further prompting and refinement, this capability can be enhanced.

Analysis of Diagrams

Next, let’s evaluate the model’s ability to analyze diagrams. We’ll start with the analysis of a solution architecture diagram for an enterprise data lake house application[6].

The model was invoked with a simple prompt: “Explain this architecture diagram. Identify the components.” It provided the following response:

As you can see, the model successfully extracted the list of components, knowledge sources, and data flows. It correctly highlighted that the architecture follows a data lake pattern, ingesting data from various sources into an AWS S3 data lake, with AWS Glue performing ETL transformations. The processed data can be queried using Athena, visualized with QuickSight, and consumed by downstream applications like SAP BusinessObjects.

How does the model handle the analysis of industrial piping and equipment diagrams?

We’ll analyze an Amine Treatment process [7] — an acid scrubbing process used in petrochemical plants and refineries to remove hydrogen sulfide (H2S) and carbon dioxide (CO2) from alkylamine gases. This process involves two major components: the Absorber and the Regenerator, and removes acids from gases (referred to as “sour gas”).

When the model was asked to describe the diagram, it came-up with the following response.

The image depicts a typical process flow diagram for an amine gas treating unit used for removing acid gases like hydrogen sulfide (H2S) and carbon dioxide (CO2) from sour natural gas streams. The main components shown are:

1. Absorber: A tall vertical column where the sour gas enters at the bottom and contacts the lean amine solution flowing down from the top tray. The rich amine solution laden with absorbed acid gases exits from the bottom tray.

2. Rich Amine: The amine solution rich in absorbed acid gases.

3. Condenser: Cools and condenses the acid gas vapors from the regenerator.

4. Reflux Drum: Receives the condensed acid gas and reflux from the top tray of the regenerator.

5. Regenerator: A vertical column where the rich amine solution is heated to release the absorbed acid gases as vapor from the top tray. The lean amine solution exits from the bottom tray.

6. Reboiler: A heat exchanger that provides the heat required to regenerate the lean amine solution by vaporizing and stripping the acid gases.

7. Lean Amine: The regenerated amine solution that is lean in acid gas content.

The typical operating ranges for the absorber (35-50°C, 5-205 atm) and regenerator (115-126°C, 1.4-1.7 atm at tower bottom) are also provided.

It’s worth noting that there are variations of the Amine treatment process, and the specific implementation may differ based on the type of inlet amine used. However, the response accurately describes the diagram and the overall process depicted.

As a next step, we evaluate the model’s code generation capabilities by asking it to generate the process flows as a JSON and generate Python code to create a block diagram of the process. The Python code generated by the model produced the following block diagram:

While there are some gaps (e.g., Sweet gas is represented as an input to the Absorber) in this flow, the overall accuracy in representing the end-to-end process is commendable.

Analysis of charts

Claude Sonnet has been trained to interpret chart data, enabling it to analyze visual elements and identify relationships among the variables represented in the charts.

We’ll start with the analysis of a chart showing the boiling points of organic chemicals (e.g., alkanes, alkenes, ethers, and alcohols) as a function of molar mass [8].

When asked to compare the characteristics of ethers and alcohols from this chart, the model provided the following observation:

Based on the chart, we can compare the characteristics of ethers and alcohols in terms of their boiling points relative to their molar masses.

Ethers, represented by the yellow line, exhibit lower boiling points compared to alcohols (represented by the light blue line) across the range of molar masses shown. This suggests that ethers are generally more volatile and have weaker intermolecular attractions compared to alcohols of similar molar masses.

For example, at a molar mass around 60, the boiling point of an ether compound is approximately 40°C, while the boiling point of an alcohol compound with a similar molar mass is around 80°C, indicating a higher boiling point for the alcohol.

Some specific examples of ethers and alcohols that can be compared from the chart include:

1. Ether with molar mass around 60 (e.g., diethyl ether, CH3CH2OCH2CH3) has a boiling point around 40°C.
2. Alcohol with molar mass around 60 (e.g., 1-propanol, CH3CH2CH2OH) has a boiling point around 80°C.

This difference in boiling points can be attributed to the presence of hydrogen bonding in alcohols, which leads to stronger intermolecular attractions and higher boiling points compared to ethers, which lack hydrogen bonding capabilities.

From the chart, it appears that alcohols generally have higher boiling points than ethers of similar molar mass. The curve for alcohols is consistently above the curve for ethers across the range of molar masses shown. This higher boiling point range for alcohols can be explained by their ability to form hydrogen bonds, which increases the intermolecular attractive forces and consequently raises the boiling points.

Although the color of the curves for ethers and alcohols was incorrect, the model’s observation that alcohols have higher boiling point ranges compared to ethers was accurate. Furthermore, the model provided a plausible explanation for this observation, attributing it to the hydrogen bonding capability of alcohols — likely drawing from its training data.

How about its interpretation capabilities if observed features are not part of its training? To evaluate the model’s interpretation capabilities when the observed features are not part of its training, I created a chart showing the relationship between fictitious properties, absolute vapor density, and thermal stability index of fictitious organic compounds: Hexanitrobenzene, Ethylpentanol, and Dimethyloctane.

When asked to analyze and compare the characteristics of these organic compounds, the model produced the following response:

Based on the chart, the vapor density (labeled as "Abs Vapor density") varies differently for the three organic compounds shown:

1. Hexanitrobenzene (orange line): The vapor density increases relatively slowly as the Thermal Stability Index increases from 0 to around 100, after which it starts rising more steeply.

2. Ethylpentanol (blue line): The vapor density remains low until a Thermal Stability Index of around 60, then it starts increasing more rapidly in an upward curve.

3. Dimethyloctane (green line): The vapor density shows the steepest rise among the three compounds, increasing very sharply as the Thermal Stability Index increases beyond around 100.

So in summary, Dimethyloctane exhibits the highest vapor density increase with increasing Thermal Stability Index, followed by Ethylpentanol and then Hexanitrobenzene showing the slowest vapor density increase over the given range of Thermal Stability Index values.

Based on the chart, Dimethyloctane appears to have the steepest increase in absolute vapor density as the thermal stability index increases, particularly after a thermal stability index of around 100. Hexanitrobenzene and Ethylpentanol have more gradual increases in absolute vapor density with increasing thermal stability index, and their curves are closer together compared to Dimethyloctane.

Although the color of the lines was incorrect, the model’s observation was remarkably accurate. It correctly identified that Dimethyloctane exhibits the steepest rise in absolute vapor density among the products after a thermal stability index of 100.

I strongly recommend reviewing the findings by Anthropic on the Claude Opus model’s ability to analyze charts related to world GDP. They conducted an analysis with both real and simulated data, and for real data, the model’s predictions were within 5% of the actual values [9].

Predictive maintenance

One of the compelling use cases for Claude Sonnet is leveraging its capabilities for predictive maintenance. We can harness its multi-modal abilities to detect damages such as dents and scratches on surfaces like automotive bodies.

When presented with an image and asked to detect and describe any observable damages, the model produced the following high-quality observation:

Yes, the image shows some damage and scratches on the side of the car near the front wheel well area. There appear to be several noticeable scratches or dents in the white paint along the side body panel right behind the front wheel and leading edge of the driver's side door.

The model’s response demonstrates its proficiency in analyzing visual data and accurately identifying and describing physical defects or damages. Its detailed observation highlights the location and nature of the dent, providing a clear and concise assessment of the issue.

While a simple example was presented here, this capability holds significant value in predictive maintenance scenarios — early detection of damages or potential issues can enable proactive measures to prevent further deterioration or costly repairs. Furthermore, the model’s ability to process and interpret visual data can be extended to various other domains, such as manufacturing, infrastructure inspection, and quality assurance, enabling intelligent automation and enhancing decision-making processes.

Multimodal RAG

With Claude Sonnet, we can build a powerful solution for integrating image information into retrieval-augmented generation (RAG) applications. By leveraging its multimodal capabilities, it can produce concise text summaries from images, which can then be embedded and stored alongside raw text content in a vector database like Chroma. We will ingest multi modal embeddings generated using Titan Multimodal Embeddings G1 into the vector store. During retrieval, both the image summaries and text chunks are fetched, allowing Sonnet to synthesize answers by jointly considering the raw images and textual information. This approach ensures that the valuable insights captured in images are not lost, enhancing the overall quality and comprehensiveness of the generated responses. Below architecture diagram explains the flow.

You can generate embeddings with Bedrock Titan and provide either text or image or both when invoke_model API is called.

body = {}
    if txt:
        body = {"inputText":txt}
    if imgbase64:
        body["inputImage"] = imgbase64
    body = json.dumps(body)
    accept = "application/json"
    contentType = "application/json"

response = br.invoke_model(
        body=body,modelId=titan_embed_model_id,accept=accept, contentType=contentType
    )

To query for matching text and images based on input query from the user, we first generate embeddings and search the vector store.

search_embeddings = embed_img_text(query_text,imagebase64)
search_results = collection.query(query_embeddings=search_embeddings,n_results=n)

Refer to the mentioned repo for full code.

As an example, user takes a picture of her dashboard and asks what’s wrong with her car.

We search the vector store for matching texts and images and add them to the context along with the user query and supplied image. The model responds with the following

"analysis": "The image shows that the instrument cluster or dashboard display of the vehicle is indicating a 'SRS Malfunction Service Required' warning. Additionally, the 'BRAKE' warning light is illuminated on the rightmost gauge. This suggests that there are issues with the vehicle's Supplemental Restraint System (SRS) and brake system.",
"rationale": "The 'SRS Malfunction Service Required' warning typically indicates a problem with the vehicle's airbag system or related safety components. This could be due to a faulty sensor, wiring issue, or other malfunction in the SRS. The illuminated 'BRAKE' warning light indicates a problem with the brake system, which could be related to low brake fluid, worn brake pads, or an issue with the brake components or hydraulic system. Both of these warnings need to be addressed promptly as they relate to critical safety systems in the vehicle."You can restrict response to the parts and user manual of the given vehicle to provide relevant response. For example when asked about type of part of the below image

model responded with

"analysis": "No, the object shown in the image is not a car part. It appears to be a wrench or spanner, which is a hand tool used for gripping and turning nuts, bolts, and other mechanical components.",
"rationale": "While wrenches and other tools are essential for performing automotive repairs and maintenance, they are not parts that are permanently installed on a vehicle. The image depicts a common open-ended wrench with a blue plastic handle and a metal body with openings of different sizes on each end to fit different sizes of nuts and bolts. Finding a loose wrench outside of your car likely means that it was accidentally left there after working on the vehicle."

In the above example, orchestration is done in the code. For a LangChain based example, you can refer to their cookbook[10] and swap the embedding and LLM models.

Conclusion

We have utilized simple prompting throughout this exercise, allowing Claude Sonnet to showcase its impressive multi-modal capabilities. The model’s performance across various tasks, including OCR, diagram analysis, chart interpretation, and predictive maintenance, has been remarkable.

Overall, the multi-modal nature of Claude Sonnet, combined with its ability to understand and reason across multiple modalities, opens up a wide range of possibilities for business workflows.

References

[1]AWS blog post on Claude 3 Sonnet availability
[2] Anthropic blog post on Claude 3 family
[3] Anthropic science paper on Claude 3 family
[4]The Comprehensive Cars (CompCars) dataset
[5] Amazon Bedrock model access documentation page
[6] AWS lake house architecture diagram
[7] Wikipedia article — Amine gas treatment
[8] Wikipedia article — Boiling points as a function of molar mass
[9]Anthropic video on Claude Opus as an economic analyst
[10] LangChain Multimodal cookbook

Thank you for taking the time to read and engage with this article. Your support in the form of following me and sharing the article is highly valued and appreciated. The views expressed in this article are my own and do not necessarily represent the views of my employer. If you have any feedback and topics you want to cover, please reach me at https://www.linkedin.com/in/gopinathk/

In Plain English 🚀

Thank you for being a part of the In Plain English community! Before you go:

Be sure to clap and follow the writer ️👏️️
Follow us: X | LinkedIn | YouTube | Discord | Newsletter
Visit our other platforms: Stackademic | CoFeed | Venture | Cubed
More content at PlainEnglish.io

Artificial Intelligence in Plain English

Exploring Claude Sonnet’s Multi-Modal Capabilities in Amazon Bedrock

Optical Character Recognition

Analysis of Diagrams

Analysis of charts

Predictive maintenance

Multimodal RAG

Conclusion

References

In Plain English 🚀

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Artificial Intelligence in Plain English

Written by Gopi Krishnamurthy

No responses yet