In depth benchmarking: Bedrock RAG (different anthropic models) + Intro to Mistral

Madhur Prashant
10 min readDec 28, 2023

--

Purpose

The purpose of this blog is to evaluate and compare the performance of different anthropic models, specifically focusing on their integration with Amazon Bedrock’s knowledge base. Knowledge bases can aggregate data from various sources into a centralized repository, facilitating efficient information retrieval and data management. In this context, the knowledge base is configured to support Retrieval Augmented Generation (RAG), on a medical use case with a lot of data on health in an S3 bucket.

The benchmarking process involves creating a knowledge base via the Boto3 SDK and linking it to a medical data source in an Amazon S3 bucket. The key performance metrics being tracked include inference latency, throughput, transactions per second, and average latency across all models when interacting with the knowledge base.

This assessment is crucial for understanding how different models perform in real-world scenarios, especially in the context of medical data, which often requires high accuracy and quick response times. By analyzing these metrics, we can determine the most efficient and effective models for specific use cases, leading to more informed decisions when deploying these technologies in practical applications.

NOTE: I work at AWS, but the thoughts, ideas and implementations on these blogs are my own.

Now without further ado, let’s take a look at the code walkthrough below!

Code Walkthrough

Benchmarking responses through Mistral Instruct and Bedrock Knowledge Bases on Latency, accuracy and relevancy — All Via Boto3 SDK

Configure your bedrock client:

import boto3
import pprint
from botocore.client import Config
from langchain.llms.bedrock import Bedrock
from IPython.display import Markdown, display
from langchain.embeddings import BedrockEmbeddings
bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')
agent_client = boto3.client('bedrock-agent')
bedrock_agent_client = boto3.client("bedrock-agent-runtime",
region_name='us-east-1',
config=bedrock_config)
# we will be using the Titan Embeddings Model to generate our Embeddings.
embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-g1-text-02", client=bedrock_client)

Create a data source that you can attach to this KB:

# Define the S3 configuration for your data source
s3_configuration = {
'bucketArn': 'arn:aws:s3:::medicaldata2039', # Replace with the ARN of your S3 bucket
'inclusionPrefixes': ['*'] # Assuming you want to include all files in the bucket
}
# Define the data source configuration
data_source_configuration = {
's3Configuration': s3_configuration,
'type': 'S3' # Type of data source, in this case, S3
}
# Replace with your knowledge base ID
knowledge_base_id = ''
# Create the data source
try:
data_source_response = agent_client.create_data_source(
knowledgeBaseId=knowledge_base_id,
name='MedicalDataSource',
description='DataSource for medical data',
dataSourceConfiguration=data_source_configuration
)
# Pretty print the response
pprint.pprint(data_source_response)
except Exception as e:
print(f"Error occurred: {e}")

Creating a data source in Amazon Bedrock involves configuring a connection to an external storage system where your data is hosted. In this case, the data is stored in an Amazon S3 bucket named medicaldata2039. The process starts by defining an S3 configuration, which includes the bucket’s ARN and inclusion prefixes. The inclusion prefixes are used to specify which files in the bucket should be included; this can range from a specific file name to a wildcard ‘*’ to include all files.
Once the S3 configuration is set, it’s incorporated into a larger data source configuration, which is then used to create the data source through the Amazon Bedrock API. The API call requires details like the name and description of the data source, as well as the ID of the knowledge base to which this data source is to be added.
After the data source is successfully created and linked to the knowledge base, the next step is to initiate a synchronization process. This process involves Amazon Bedrock scanning the specified S3 bucket, based on the inclusion prefixes, and ingesting the data into the knowledge base. During this synchronization, the data is processed, and potentially, embeddings are created based on the configured embedding model. This makes the data searchable and retrievable through the knowledge base.
Once synchronization is complete, the content from the S3 bucket is available in the knowledge base. You can then perform queries against this knowledge base to retrieve relevant information based on your search criteria. This integration allows for a seamless connection between your stored data in S3 and the powerful search and retrieval capabilities offered by Amazon Bedrock, making it a robust solution for managing and accessing large volumes of data efficiently.

Now that you have created your KB, let’s retreive and generate from it

import boto3
import pprint
from botocore.client import Config
pp = pprint.PrettyPrinter(indent=2)
bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0
model_id = "anthropic.claude-instant-v1"
region_id = ""
kb_id = ""

Now, let’s retrieve and generate responses and store them in a dictionary

def retrieveAndGenerate(input, kbId, sessionId=None, model_id = "anthropic.claude-instant-v1", region_id = ""):
// ADD YOUR FUNCTION HERE
query = "What are trends in health insurance coverage?"
response = retrieveAndGenerate(query, kb_id,model_id=model_id,region_id=region_id)
generated_text = response['output']['text']
pp.pprint(generated_text)
('The primary driver of declining enrollment in private health insurance has '
'been the increasing cost of health care, contributing to the rising '
'proportion of uninsured Americans. Approximately 16% of Americans lack '
'health insurance at any given time.')

Retrieving and Generating for a bunch of questions on medical health documentation for benchmarking

questions = {
1: "What are some ways in which an individual’s health can be maintained or improved?",
2: "How does the current American health care system contribute to innovation in treatments?",
3: "What are the key objectives of the Administration's policies in health care reform?",
4: "How does consumer-directed health insurance plans fit into the Administration's health care policies?",
5: "What proposal did the President make in the State of the Union Address regarding health insurance?",
6: "What factors contribute to health alongside health care services?",
7: "How has the trend of obesity changed in the United States since the late 1970s?",
8: "What are some common conditions that affect job productivity through absenteeism and presenteeism?",
9: "What has been the trend in national health care spending in the United States since 1960?",
10: "How has life expectancy in the United States changed since 1900, and what pattern is observed from birth and from age 65?"
}
import time
import pandas as pd
async def generate_responses(model_id, region_id, kb_id, questions):
responses = []
for qid, query in questions.items():
start_time = time.time()
# If retrieveAndGenerate is not an async function, call it directly
response = retrieveAndGenerate(query, kb_id, model_id=model_id, region_id=region_id)
end_time = time.time()
latency = end_time - start_time
generated_text = response['output']['text']
responses.append({
"Question ID": qid,
"Question": query,
"Response": generated_text,
"Inference Latency (s)": latency,
"Model": "anthropic-clause-instant"
})
return pd.DataFrame(responses)
model_id = "anthropic.claude-instant-v1"
region_id = ""
kb_id = ""
results_df = await generate_responses(model_id, region_id, kb_id, questions)
# Display or save the results
print(results_df)
results_df.to_csv("query_responses_latency.csv", index=False)
Question ID                                           Question  \
0 1 What are some ways in which an individual’s he...
1 2 How does the current American health care syst...
2 3 What are the key objectives of the Administrat...
3 4 How does consumer-directed health insurance pl...
4 5 What proposal did the President make in the St...
5 6 What factors contribute to health alongside he...
6 7 How has the trend of obesity changed in the Un...
7 8 What are some common conditions that affect jo...
8 9 What has been the trend in national health car...
9 10 How has life expectancy in the United States c...
Response  Inference Latency (s)  \
0 Individual's health can be maintained or impro... 5.500159
1 The American health care system contributes to... 5.462244
2 The key objectives of the Administration's pol... 4.812972
3 The Administration supports consumer-directed ... 3.352419
4 The President proposed changing the tax treatm... 3.226382
5 Sorry, I am unable to assist you with this req... 2.228114
6 The prevalence of obesity in the United States... 3.828322
7 Some common conditions that affect job product... 4.539162
8 National health care spending in the United St... 3.640423
9 Life expectancy in the United States has incre... 3.342381
Model  
0 anthropic-clause-instant
1 anthropic-clause-instant
2 anthropic-clause-instant
3 anthropic-clause-instant
4 anthropic-clause-instant
5 anthropic-clause-instant
6 anthropic-clause-instant
7 anthropic-clause-instant
8 anthropic-clause-instant
9 anthropic-clause-instant

more metrics

import time
import pandas as pd
def count_words(text):
return len(text.split())
async def generate_responses(model_id, region_id, kb_id, questions):
responses = []
for qid, query in questions.items():
start_time = time.time()
# Call the function to get a response
response = retrieveAndGenerate(query, kb_id, model_id=model_id, region_id=region_id)
end_time = time.time()
latency = end_time - start_time
generated_text = response['output']['text']
input_word_count = count_words(query)
output_word_count = count_words(generated_text)
word_throughput = output_word_count / latency if latency > 0 else 0
tps = 1 / latency if latency > 0 else 0
responses.append({
"Question ID": qid,
"Question": query,
"Response": generated_text,
"Input Word Count": input_word_count,
"Output Word Count": output_word_count,
"Word Throughput (words/s)": word_throughput,
"Transactions per Second (TPS)": tps,
"Inference Latency (s)": latency,
"Model": model_id
})
return pd.DataFrame(responses)
# Example usage
model_id = "anthropic.claude-instant-v1"
region_id = ""
kb_id = ""
results_df = await generate_responses(model_id, region_id, kb_id, questions)
# Display or save the results
print(results_df)
results_df.to_csv("query_responses_latency.csv", index=False)

OUTPUT

Question ID                                           Question  \
0 1 What are some ways in which an individual’s he...
1 2 How does the current American health care syst...
2 3 What are the key objectives of the Administrat...
3 4 How does consumer-directed health insurance pl...
4 5 What proposal did the President make in the St...
5 6 What factors contribute to health alongside he...
6 7 How has the trend of obesity changed in the Un...
7 8 What are some common conditions that affect jo...
8 9 What has been the trend in national health car...
9 10 How has life expectancy in the United States c...
Response  Input Word Count  \
0 Individual's health can be maintained or impro... 14
1 The American health care system contributes to... 13
2 The key objectives of the Administration's pol... 13
3 The Administration supports consumer-directed ... 13
4 The President proposed changing the tax treatm... 16
5 Sorry, I am unable to assist you with this req... 9
6 The prevalence of obesity in the United States... 15
7 Some common conditions that affect job product... 13
8 National health care spending in the United St... 16
9 Life expectancy in the United States has incre... 22
Output Word Count  Word Throughput (words/s)  \
0 56 17.247064
1 63 12.711360
2 49 18.302368
3 65 20.219222
4 20 8.170691
5 10 5.063330
6 55 22.123053
7 19 6.813371
8 63 17.261546
9 102 24.651588
Transactions per Second (TPS)  Inference Latency (s)  \
0 0.307983 3.246929
1 0.201768 4.956197
2 0.373518 2.677249
3 0.311065 3.214763
4 0.408535 2.447773
5 0.506333 1.974985
6 0.402237 2.486094
7 0.358598 2.788634
8 0.273993 3.649731
9 0.241682 4.137665
Model  
0 anthropic.claude-instant-v1
1 anthropic.claude-instant-v1
2 anthropic.claude-instant-v1
3 anthropic.claude-instant-v1
4 anthropic.claude-instant-v1
5 anthropic.claude-instant-v1
6 anthropic.claude-instant-v1
7 anthropic.claude-instant-v1
8 anthropic.claude-instant-v1
9 anthropic.claude-instant-v1
import pandas as pd
import asyncio
async def main():
questions = {
1: "What are some ways in which an individual’s health can be maintained or improved?",
2: "How does the current American health care system contribute to innovation in treatments?",
3: "What are the key objectives of the Administration's policies in health care reform?",
4: "How does consumer-directed health insurance plans fit into the Administration's health care policies?",
5: "What proposal did the President make in the State of the Union Address regarding health insurance?",
6: "What factors contribute to health alongside health care services?",
7: "How has the trend of obesity changed in the United States since the late 1970s?",
8: "What are some common conditions that affect job productivity through absenteeism and presenteeism?",
9: "What has been the trend in national health care spending in the United States since 1960?",
10: "How has life expectancy in the United States changed since 1900, and what pattern is observed from birth and from age 65?"
}
# Generate responses for the first model
model_id_1 = "anthropic.claude-instant-v1"
results_df_1 = await generate_responses(model_id_1, region_id, kb_id, questions)
# Generate responses for the second model
model_id_2 = "anthropic.claude-v2"
results_df_2 = await generate_responses(model_id_2, region_id, kb_id, questions)
# Combine results
combined_results = pd.concat([results_df_1, results_df_2])
# Calculate and print average latency for each model
avg_latency_1 = results_df_1["Inference Latency (s)"].mean()
avg_latency_2 = results_df_2["Inference Latency (s)"].mean()
print(f"Average Latency for {model_id_1}: {avg_latency_1} seconds")
print(f"Average Latency for {model_id_2}: {avg_latency_2} seconds")
# Optionally, save combined results
combined_results.to_csv("combined_query_responses_latency.csv", index=False)
# Run the main function
await main()

Mistral Instruct using LLaMa index: Generating retrieval on same questions on the same data to benchmark and test

import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import HuggingFaceLLM
## Now let's import the data on the pdf
documents = SimpleDirectoryReader("medical_datapdf").load_data()
## prompt engineer it
from llama_index.prompts.prompts import SimpleInputPrompt
system_prompt = "You are a medical assistant. You will answer all medical questions accurately based on the instructions and context you have."
# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = "<|USER|>{query_str}<|ASSISTANT|>"
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'mistralai/Mistral-7B-Instruct-v0.1',
'SM_NUM_GPUS': json.dumps(1)
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="1.1.0"),
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=300,
)
# send request
predictor.predict({
"inputs": "My name is Julien and I like to",
})
---------!
[{'generated_text': "My name is Julien and I like to play guitar. I'm a beginner and I'm trying to learn how to play."}]

Creating embeddings using the hugging face all-mpnet-base-v2 sentence transformers model

!pip install pdfplumber
import pdfplumber
def extract_text_from_pdf(pdf_path):
with pdfplumber.open(pdf_path) as pdf:
text = ''
for page in pdf.pages:
text += page.extract_text()
return text
pdf_text = extract_text_from_pdf('medical_datapdf/ERP-2008-chapter4 (1).pdf')
query = "What are trends in health insurance coverage?"
data_for_model = {
"inputs": {
"question": query, # Your question
"context": pdf_text # The context from the PDF
}
}
# # Send the request to the SageMaker endpoint
response = predictor.predict(json.dumps(data_for_model))
print(response)

You can test further for benchmarks as you create a dictionary here to make sure it works

Results

  1. Average Latency for anthropic.claude-instant-v1: 3.258636808395386 seconds
  2. Average Latency for anthropic.claude-v2: 9.073037600517273 seconds

--

--

Madhur Prashant

Learning is my passion, so is the intersection of technology & strategy. I am passionate about product. I work @ AWS but these are my own personal thoughts!