Building an Image Understanding API with Claude and Amazon Bedrock

Introduction

In this article, we will demonstrate how to leverage Amazon Bedrock to utilize Anthropic Claude 3 for image understanding, integrating it into a callable API. This development is based on the lab titled “Building with Amazon Bedrock and LangChain: Multimodal & image labs, Lab M-9: Image understanding” from the AWS workshop catalog.

As we delve deeper into the digital era, the development of multimodality models like Anthropic Claude 3 has become crucial in enhancing machine understanding. These models excel in processing and generating content across different data forms, such as text and images, with remarkable capabilities in image-to-text tasks. By translating visual data into text, we unlock a wealth of information that can automate and refine processes across various sectors. For example, in e-commerce, image-to-text capabilities can automatically categorize products from images alone, improving search efficiency and accuracy. This technology also aids in generating automatic photo descriptions that enhance user experiences by providing information that might not be explicitly stated in product titles or descriptions.

This introduction sets the stage for discussing the specific functionalities and applications of image understanding APIs, highlighting the transformative potential they hold in various industries.

The Claude 3 model family by Anthropic sets new industry benchmarks across various cognitive tasks. It includes models like Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus, each offering increasingly powerful performance. Claude 3 models excel in analysis, nuanced content creation, and real-time data extraction, making them ideal for our image understanding application.

Run the lab

In this second section, we will follow the original lab steps to set up our environment and build the image understanding application. The first step is to set up the environment (Cloud9) and Amazon Bedrock by following the setup page instructions provided here.

Once the environment is set up, we will proceed by pasting the provided code snippets into the respective Python files as instructed in the lab.

Setting Up the Supporting Library

The first part involves creating the supporting library that connects the Streamlit front end to the Bedrock back end. Open the image_understanding_lib.py file in the workshop/labs/image_understanding folder and add the following code:

import boto3
import json
import base64
from io import BytesIO

#get a BytesIO object from file bytes
def get_bytesio_from_bytes(image_bytes):
image_io = BytesIO(image_bytes)
return image_io

#get a base64-encoded string from file bytes
def get_base64_from_bytes(image_bytes):
resized_io = get_bytesio_from_bytes(image_bytes)
img_str = base64.b64encode(resized_io.getvalue()).decode("utf-8")
return img_str

#load the bytes from a file on disk
def get_bytes_from_file(file_path):
with open(file_path, "rb") as image_file:
file_bytes = image_file.read()
return file_bytes

#get the stringified request body for the InvokeModel API call
def get_image_understanding_request_body(prompt, image_bytes=None, mask_prompt=None, negative_prompt=None):
input_image_base64 = get_base64_from_bytes(image_bytes)

body = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 2000,
"temperature": 0,
"messages": [
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": input_image_base64,
},
},
{
"type": "text",
"text": prompt
}
],
}
],
}

return json.dumps(body)

#generate a response using Anthropic Claude
def get_response_from_model(prompt_content, image_bytes, mask_prompt=None):
session = boto3.Session()

bedrock = session.client(service_name='bedrock-runtime') #creates a Bedrock client

body = get_image_understanding_request_body(prompt_content, image_bytes, mask_prompt=mask_prompt)

response = bedrock.invoke_model(body=body, modelId="anthropic.claude-3-sonnet-20240229-v1:0", contentType="application/json", accept="application/json")

response_body = json.loads(response.get('body').read()) # read the response

output = response_body['content'][0]['text']

return output

The image_understanding_lib.py file serves as the supporting library for the image understanding application. It contains helper functions to process image data and functions to interact with Amazon Bedrock. Here’s a breakdown of the key functions:

  • get_bytesio_from_bytes(image_bytes): Converts file bytes to a BytesIO object, which is a necessary step for further processing.
  • get_base64_from_bytes(image_bytes): Encodes the image bytes into a base64 string, which is used to prepare the image for the API call.
  • get_bytes_from_file(file_path): Reads and returns the bytes from a specified file path.
  • get_image_understanding_request_body(prompt, image_bytes): Constructs the request body for the Bedrock API call, including the base64-encoded image and the user’s prompt.
  • get_response_from_model(prompt_content, image_bytes): Calls the Bedrock API with the constructed request body and returns the model’s response.

These functions enable the application to convert images into a suitable format for the API, send the request to the model, and handle the response.

Creating the Streamlit Front-End App

Next, we create the Streamlit front-end application. Open the image_understanding_app.py file in the same folder and add the following code:

import streamlit as st
import image_understanding_lib as glib

st.set_page_config(layout="wide", page_title="Image Understanding")

st.title("Image Understanding")

col1, col2, col3 = st.columns(3)

prompt_options_dict = {
"Image caption": "Please provide a brief caption for this image.",
"Detailed description": "Please provide a thoroughly detailed description of this image.",
"Image classification": "Please categorize this image into one of the following categories: People, Food, Other. Only return the category name.",
"Object recognition": "Please create a comma-separated list of the items found in this image. Only return the list of items.",
"Subject identification": "Please name the primary object in the image. Only return the name of the object in <object> tags.",
"Writing a story": "Please write a fictional short story based on this image.",
"Answering questions": "What emotion are the people in this image displaying?",
"Transcribing text": "Please transcribe any text found in this image.",
"Translating text": "Please translate the text in this image to French.",
"Other": "",
}

prompt_options = list(prompt_options_dict)

image_options_dict = {
"Food": "images/food.jpg",
"People": "images/people.jpg",
"Person and cat": "images/person_and_cat.jpg",
"Room": "images/room.jpg",
"Text in image": "images/text2.png",
"Toy": "images/toy_car.jpg",
"Other": "images/house.jpg",
}

image_options = list(image_options_dict)

with col1:
st.subheader("Select an Image")

image_selection = st.radio("Image example:", image_options)

if image_selection == 'Other':
uploaded_file = st.file_uploader("Select an image", type=['png', 'jpg'], label_visibility="collapsed")
else:
uploaded_file = None

if uploaded_file and image_selection == 'Other':
uploaded_image_preview = glib.get_bytesio_from_bytes(uploaded_file.getvalue())
st.image(uploaded_image_preview)
else:
st.image(image_options_dict[image_selection])

with col2:
st.subheader("Prompt")

prompt_selection = st.radio("Prompt example:", prompt_options)

prompt_example = prompt_options_dict[prompt_selection]

prompt_text = st.text_area("Prompt",
value=prompt_example,
height=100,
help="What you want to know about the image.",
label_visibility="collapsed")

go_button = st.button("Go", type="primary")

with col3:
st.subheader("Result")

if go_button:
with st.spinner("Processing..."):

if uploaded_file:
image_bytes = uploaded_file.getvalue()
else:
image_bytes = glib.get_bytes_from_file(image_options_dict[image_selection])

response = glib.get_response_from_model(
prompt_content=prompt_text,
image_bytes=image_bytes,
)

st.write(response)

The image_understanding_app.py file sets up the Streamlit front-end for the image understanding application. It provides a user interface to upload images and input prompts, then displays the generated results. Here’s a breakdown of the key sections:

  • Streamlit Page Configuration: Configures the page layout and title.
  • User Interface Setup:
    -Image Selection: Allows users to select an image from predefined options or upload their own image.
    -Prompt Selection: Provides various prompt options for users to describe what they want to know about the image.
    -Result Display: Shows the result generated by the model in response to the user’s input.

The application interacts with the image_understanding_lib.py library to process the image and prompt, send the request to the model, and display the response to the user.

After running the above setup successfully and preview the app, you will see a user interface similar to the screenshot below .

Next, we will discuss the modifications made to convert the Streamlit app into a callable API.

Modified Code Explanation and Comparison

image_understanding_lib_v2.py

This modified version of the original image_understanding_lib.py includes additional functionality and changes to support the API integration. Below are the detailed explanations and comparisons:

  1. Helper Functions: These remain largely unchanged and serve the same purpose as in the original code — processing image data.
    get_bytesio_from_bytes(image_bytes): Converts file bytes to a BytesIO object.
    get_base64_from_bytes(image_bytes): Encodes image bytes into a base64 string.
    get_bytes_from_file(file_path): Reads bytes from a file at a given path.
  2. New Function: upload_new_image(image_bytes): This is a placeholder function intended for handling the image upload process. The actual implementation needs to be added based on the specific requirements for storing the uploaded images.
  3. Function Modification:
    get_image_understanding_request_body(prompt, image_bytes=None, mask_prompt=None, negative_prompt=None)
    :
    Constructs the request body for the Bedrock API call, including the base64-encoded image and the user’s prompt. This remains similar to the original but is explicitly designed for the new API context.
  4. Function Modification: get_response_from_model(prompt_content, image_bytes=None, mask_prompt=None):
  • Checks if image_bytes is None and returns a default response if no image is provided.
  • Calls the Bedrock API with the constructed request body and returns the model’s response.
import boto3
import json
import base64
from io import BytesIO

# Convert file bytes to a BytesIO object
def get_bytesio_from_bytes(image_bytes):
image_io = BytesIO(image_bytes)
return image_io

# Get a base64-encoded string from file bytes
def get_base64_from_bytes(image_bytes):
resized_io = get_bytesio_from_bytes(image_bytes)
img_str = base64.b64encode(resized_io.getvalue()).decode("utf-8")
return img_str

# Load bytes from a file on disk
def get_bytes_from_file(file_path):
with open(file_path, "rb") as image_file:
file_bytes = image_file.read()
return file_bytes

# Upload a new image
def upload_new_image(image_bytes):
# Implement your image upload logic here and return the file path or bytes of the uploaded image
pass

# Get the stringified request body for the InvokeModel API call
def get_image_understanding_request_body(prompt, image_bytes=None, mask_prompt=None, negative_prompt=None):
input_image_base64 = get_base64_from_bytes(image_bytes)

body = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 2000,
"temperature": 0,
"messages": [
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": input_image_base64,
},
},
{
"type": "text",
"text": prompt
}
],
}
],
}

return json.dumps(body)

# Generate a response using Anthropic Claude
def get_response_from_model(prompt_content, image_bytes=None, mask_prompt=None):
if image_bytes is None:
return "Default response for no image provided."

session = boto3.Session()
bedrock = session.client(service_name='bedrock-runtime') # Creates a Bedrock client

body = get_image_understanding_request_body(prompt_content, image_bytes, mask_prompt=mask_prompt)

response = bedrock.invoke_model(body=body, modelId="anthropic.claude-3-sonnet-20240229-v1:0", contentType="application/json", accept="application/json")

response_body = json.loads(response.get('body').read()) # Read the response

output = response_body['content'][0]['text']

return output

Flask API Deployment

The image_understanding_app_v2.py file has been modified to use Flask to deploy the application as an API. This allows users to interact with the image understanding functionality through HTTP requests.

  1. Flask Setup: Initializes the Flask application.
  2. API Endpoints:
  • /image_understanding: Accepts a POST request with a prompt and an image file. It reads the image, processes it using the functions in image_understanding_lib_v2.py, and returns the response.
  • /upload_image: Accepts a POST request with an image file. It reads the image and calls the upload_new_image function.
from flask import Flask, request, jsonify
import image_understanding_lib_v2 as glib

app = Flask(__name__)

@app.route('/image_understanding', methods=['POST'])
def image_understanding():
prompt = request.form.get('prompt')
image_file = request.files.get('image')

if image_file is None:
return jsonify({'error': 'Image file is required.'}), 400

image_bytes = image_file.read()
response = glib.get_response_from_model(prompt, image_bytes)
return jsonify({'response': response})

@app.route('/upload_image', methods=['POST'])
def upload_image():
image_file = request.files.get('image_file')

if image_file is None:
return jsonify({'error': 'No image file provided.'}), 400

image_bytes = image_file.read()
image_path = glib.upload_new_image(image_bytes)
return jsonify({'image_path': image_path})

if __name__ == '__main__':
app.run(debug=True)

Testing the API

This image is a random invoice image I found online. I added it to the images folder, as shown in the attached image.

To test the API, use the following curl command and check the response:

curl -X POST http://localhost:5000/image_understanding \
-H "Content-Type: multipart/form-data" \
-F "prompt=Please provide a detailed description of this image." \
-F "image=@/home/ubuntu/environment/workshop/labs/image_understanding/images/fake_invoice.jpg"

This test prompts the model to provide a detailed description of the invoice image.

Response Example

{
"response": "This image appears to be an invoice from the company Slack. The invoice number is 1223113 and it is dated December 1, 2023. The payment terms are Net 45, with a due date of January 15, 2024.\n\nThe invoice is addressed to MineralTree at 101 Arch Street, Boston, MA 02110. The balance due on the invoice is $1,725.00.\n\nThe item listed on the invoice is \"Business+ Monthly User License - August 2023\". The quantity is 115 licenses at a rate of $15.00 each, totaling $1,725.00.\n\nThe subtotal, tax (0%), and total amount all match the balance due of $1,725.00.\n\nThe invoice displays the Slack logo and branding, as well as Slack's address of 500 Howard Street, San Francisco, CA 94105."
}

Second Test

Below is the second test:

curl -X POST http://localhost:5000/image_understanding \
-H "Content-Type: multipart/form-data" \
-F "prompt=Please summarize the text content of this invoice image." \
-F "image=@/home/ubuntu/environment/workshop/labs/image_understanding/images/fake_invoice.jpg"

This test prompts the model to summarize the text content of the invoice image.

Response Example

{
"response": "This image appears to be an invoice from the company Slack for their Business+ Monthly User License service. The invoice is addressed to a company called MineralTree located in Boston, MA. The invoice date is December 1, 2023, with a payment term of Net 45 days and a due date of January 15, 2024. The total balance due for 115 user licenses at a rate of $15.00 each is $1,725.00. The invoice includes details such as the billing addresses for both companies, the invoice number, payment terms, due date, and a breakdown of the item, quantity, rate, and amount."
}

Summary

In this implementation, we demonstrated how to leverage Amazon Bedrock to call Anthropic Claude 3 for image understanding and integrate it into a callable API. We followed the steps from the AWS workshop lab titled “Building with Amazon Bedrock and LangChain: Multimodal & image labs, Lab M-9: Image understanding,” and made some modifications to suit our requirements.

Possible Applications:

  1. Automated Document Processing: This implementation can be used to automate the processing of invoices, receipts, and other documents, extracting key information for further analysis or record-keeping.
  2. Content Moderation: The API can help in moderating image content on platforms by identifying and categorizing images, ensuring they meet the platform’s guidelines.
  3. Accessibility Enhancement: The image understanding API can be used to generate accessible alternative text and captions for images, improving accessibility for visually impaired users.
  4. Image-Based Search: Integrating this API with a search engine can enable users to perform searches based on the content of images, enhancing the search experience.
  5. Custom Visual Content Generation: Businesses can use this API to generate descriptions and summaries for visual content, aiding in marketing, documentation, and content creation processes.

This project showcases the potential of integrating advanced AI models like Anthropic Claude 3 with scalable cloud services to build powerful and versatile applications.

--

--