Observability with LLM Agents — Part 2

How to trace Amazon Bedrock Agents with Amazon OpenSearch

Felix Huthmacher
6 min readJan 21, 2024

In the below tutorial we go through the steps that are needed to add observability to our LLM agent with OpenTelemetry and OpenSearch.

It is recommended to review part 1 first, before proceeding with part 2.

Now let’s review our observability pipeline.

Figure 1: High-level solution architecture

First we have the Bedrock Agent, that among other things leverages a Lambda function to complete a user request. Then, within the Lambda function we use OpenTelemetry for instrumentation.

OpenTelemetry is an open-source observability framework that aims to standardize the generation, collection, and management of telemetry data(traces, metrics, and logs(experimental)).

We use Open Telemetry to collect the traces, metrics and logs from our Bedrock agent, and forward them to OpenSearch Ingestion. OpenSearch Ingestion then handles the ingestion into OpenSearch from where we then can analyze the data.

Deployment Steps

In part 1 we used this IaC template to create the following resources:
- Amazon Simple Storage Service (Amazon S3) bucket
- Glue database, crawler, and table for sample dataset
- 3 AWS Lambda functions & Lambda layers
- 2 IAM roles
- Elastic Container Registry (hosts container image for Lambda function)
- Bedrock Knowledge Base & Bedrock Agent
- OpenSearch Serverless Collection as backend for Bedrock KB
- Provisioned OpenSearch cluster for observability
- OpenSearch Ingestion pipeline for observability
- temporary EC2 instance to pull and push Docker image to ECR

This Cloud Formation Template also already included all of the components that are required for the observability instrumentation.

Let’s review the main components of the above observability solution architecture.

1. OpenTelemetry Lambda Layer

OpenTelemetry does not provide a container image for AWS Lambda just yet. That’s why we needed to unpack the existing AWS Lambda OT Python layer as part of the docker build process as shown in the extract of our Dockerfile below.

ARG AWS_DEFAULT_REGION=${AWS_DEFAULT_REGION:-"us-east-1"}
ARG AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID:-""}
ARG AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY:-""}
ENV AWS_DEFAULT_REGION=${AWS_DEFAULT_REGION}
ENV AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
ENV AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}

RUN yum install unzip aws-cli -y

RUN mkdir -p /opt

RUN curl $(aws lambda get-layer-version-by-arn --arn arn:aws:lambda:us-east-1:901920570463:layer:aws-otel-python-amd64-ver-1-21-0:1 --query 'Content.Location' --output text) --output layer.zip
RUN unzip layer.zip -d /opt
RUN rm layer.zip

It is important to note, that at this point OpenTelemetry is not officially supporting Serverless Containers. Nonetheless the above worked for me, so please let me know about your experience in the comments below.

2. OTEL collector

To configure the OTEL collector in our AWS Lambda extension, we used the below configuration.

extensions:
# https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/sigv4authextension/README.md
sigv4auth:
region: "us-east-1"
service: "osis"

receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318

exporters:
logging:
verbosity: detailed

otlphttp:
auth:
authenticator: sigv4auth
compression: none
traces_endpoint: ${TRACE_ENDPOINT}
metrics_endpoint: ${METRICS_ENDPOINT}
logs_endpoint: ${LOGS_ENDPOINT}


service:
extensions: [sigv4auth]
pipelines:
traces:
receivers: [otlp]
exporters: [otlphttp]

metrics:
receivers: [otlp]
exporters: [otlphttp]

logs:
receivers: [otlp]
exporters: [otlphttp]

3. OpenSearch cluster

Go to the OpenSearch console and review the domain endpoint that we created as part of the IaC deployment in part 1.

Go to the security configuration tab and verify that you have an access policy similar to the one shown below.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": "es:*",
"Resource": "arn:aws:es:us-east-1: <your account #>:domain/<your domain name>/*"
}
]
}
Figure 2: OpenSearch Access Policy

4. OpenSearch Ingestion pipeline

You should have 3 pipelines as shown below.

Figure 3: OpenSearch Ingestion Pipelines

Verify that the OpenSearch ingestion pipelines have been created, similar to the instructions outlined in the documentation.

5. IAM role (e.g. bedrock-agent-finance)

Verify that the IAM role that is associated with the AWS Lambda function has sufficient permissions to ingest data into the respective OpenSearch Ingestion pipeline as shown below.


"Version": "2012-10-17",
"Statement": [
{
"Sid": "PermitsWriteAccessToPipeline",
"Effect": "Allow",
"Action": "osis:Ingest",
"Resource": "arn:aws:osis:us-east-1:<your account #>:pipeline/trace-pipeline"
},
{
"Sid": "PermitsWriteAccessToPipeline2",
"Effect": "Allow",
"Action": "osis:Ingest",
"Resource": "arn:aws:osis:us-east-1: <your account #>:pipeline/metrics-pipeline"
},
{
"Sid": "PermitsWriteAccessToPipeline3",
"Effect": "Allow",
"Action": "osis:Ingest",
"Resource": "arn:aws:osis:us-east-1: <your account #>:pipeline/log-pipeline"
}
]
}

6. AWS Lambda environment variables

To enable the AWS Distro for OpenTelemetry in our Lambda function, we need the following environment variables.

Figure 4: AWS Lambda environment variables

a. AWS_LAMBDA_EXEC_WRAPPER set to “/opt/otel-instrument” enables the auto-instrumentation.

b. OPENTELEMETRY_COLLECTOR_CONFIG_FILE set to “/var/task/collector.yml” references the configuration file which defines our observability pipeline.

c. OTEL_SERVICE_NAME set to “FinancialAgent” specifies the application/service name, which we then can use to identify traces, metrics, and logs in OpenSearch.

d. LOGS_ENDPOINT set to the corresponding OpenSearch Ingestion endpoint created earlier.

e. METRICS_ENDPOINT set to the corresponding OpenSearch Ingestion endpoint created earlier.

f. TRACE_ENDPOINT set to the corresponding OpenSearch Ingestion endpoint created earlier.

g. OTEL_LOG_LEVEL: set to DEBUG specifies the log level.

7. Observability pipeline

7a. Traces

While OpenTelemetry supports auto-instrumentation, we added minimal manual instrumentation to define traces and metrics as needed. Below is an example of such a manual instrumentation for traces, which is already included in the source code of part 1.

from opentelemetry import trace
## Creates a tracer from the global tracer provider
tracer = trace.get_tracer("FinancialAgent")

@tracer.start_as_current_span("FinancialAgent_lambda_handler")
def handler(event, context):
#<do something>

7b. Metrics

Our sample application also includes a couple of metrics such as counters for the number of invocations of the different methods.

# Acquire a meter.
meter = metrics.get_meter(__name__)

# Now create a counter instrument to make measurements with
agent_counter = meter.create_counter(
"agent.calls",
description="The number of agent calls",
)
get_investment_research_counter = meter.create_counter(
"get_investment_research.calls",
description="The number of get_investment_research calls",
)
get_existing_portfolio_counter = meter.create_counter(
"get_existing_portfolio.calls",
description="The number of get_existing_portfolio calls",
)

7c. Logs

And we also included logging, which is currently in experimental state with the below manual instrumentation.

import logging
from opentelemetry._logs import get_logger
from opentelemetry._logs import get_logger_provider
from opentelemetry._logs import set_logger_provider
from opentelemetry.exporter.otlp.proto.http._log_exporter import (
OTLPLogExporter,
)
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor, SimpleLogRecordProcessor
from opentelemetry.sdk.extension.aws.resource._lambda import (
AwsLambdaResourceDetector,
)
from opentelemetry.sdk.resources import get_aggregated_resources

logger_provider = LoggerProvider(
resource=get_aggregated_resources(
[
AwsLambdaResourceDetector(),
]
),
)
set_logger_provider(logger_provider)

exporter = OTLPLogExporter(endpoint='http://0.0.0.0:4318/v1/logs')
logger_provider.add_log_record_processor(BatchLogRecordProcessor(exporter))
handler = LoggingHandler(level=logging.INFO, logger_provider=logger_provider)
# Attach OTLP handler to root logger
logger = logging.getLogger().addHandler(handler)
# Create different namespaced loggers
loggerAgent = logging.getLogger("financeagent.handler")
loggerAgent.setLevel(os.environ['OTEL_LOG_LEVEL'])

8. Review Agent runtime in Amazon OpenSearch

With our Observability pipeline in place, we can now go into our Amazon OpenSearch cluster and review the traces, metrics, and logs from our previous test prompts (e.g. “Should I invest in Amazon?”).

In OpenSearch, navigate to Observability Traces in the menu side bar, and select one of the traces.

Figure 5: Trace Analytics Overview
Figure 6: Financial Agent Trace

Traces make it easy to identify potential bottlenecks in our micro-service architecture. For example, here we can quickly spot that the Yahoo API call (which we use to retrieve financial information) is relatively slow and prone to error.

To review logs and metrics, create corresponding index patterns following the documentation here.

Figure 7: OpenSearch Index Patterns

Now we can review logs and correlate them with traces.

Figure 8: OpenSearch “Agent Finance” logs

And we can create dashboards visualizing our metrics as well as configure alerting / corresponding monitoring.

Figure 9: OpenSearch metric example

Summary

In part 2 of this tutorial we covered how to leverage OpenTelemetry for instrumentation and forwarded traces, metrics, and logs to a central monitoring solution such as Amazon OpenSearch.

This architecture does not replace purpose-built GenAI observability solutions like arize, LangSmith, or whylabs. But it is a good starting point for performance benchmarks that cover the entire GenAI application stack, as it allows you to easily identify bottlenecks within a micro-service architecture.

What can be improved

In a future blog I would like to explore how to best incorporate Bedrock Agents trace capability and also explore how to best integrate this solution with other GenAI observability solutions such as LangSmith.

--

--

Felix Huthmacher

Solution Architect with expertise in developing business cases and designing and implementing AI/ML solutions supporting business processes across industries