Observability with Amazon CloudWatch, AWS X-Ray, Prometheus, and Grafana

Exploring the various aspects of Observability — metrics, alarms, tracing, and dashboards — and the tools you can use for each.

Michael Sambol
Cloud Native Daily

--

Introduction

Observability is an essential and often overlooked aspect of software systems. It is especially important as applications scale and issues become difficult to diagnose. In this blog, we’ll discuss various methods of observability, piggybacking on my previous blog about HL7 message processing. We’ll cover the following aspects of observability:

  • Metrics: these include application and infrastructure metrics. Application metrics include successful and failed API calls, database queries, and cache hit rate, to name a few. Infrastructure metrics include memory usage, disk I/O, database latency, etc.
  • Alarms: alerts when metrics cross certain thresholds. For example, you may want to alarm when failed API calls are greater than 5% of all requests.
  • Tracing: allows you to see requests as they traverse through the system and the parts of the system that introduce bottlenecks and high latencies.
  • Dashboards: so you can see system performance through a single pane of glass.

Nothing above is particularly innovative, but observability can be an afterthought and take a backseat to application code. There is plenty written on this subject, but I want to join together various aspects of observability with a real-world example and deployable code.

Tech Stack

We’ll discuss a variety of products and services in this blog, some with overlapping functionality. There’s a plethora of great tools in the software world, and your organization may have chosen specific ones for a variety of reasons. The blog will cover observability through the lens of AWS since this is where we deployed the HL7 solution in my previous blog. Some services are native to AWS but others can be deployed elsewhere. We’ll also keep cost in mind, as services often bill on data ingestion, which adds up quickly when you scale. Our primary tech stack includes:

  • Amazon CloudWatch: an application and infrastructure monitoring service from AWS that integrates seamlessly with other AWS services. We’ll use CloudWatch for logs, metrics, alarms, and tracing.
  • AWS X-Ray: a distributed tracing service from AWS that helps analyze and debug applications by providing a complete view of requests as they traverse through the system.
  • Prometheus: an open-source monitoring system and time series database. We’ll use the managed flavor on AWS, Amazon Managed Service for Prometheus.
  • Grafana: an open-source analytics and interactive visualization tool. Again we’ll use the managed flavor, Amazon Managed Grafana.

Ownership

I am a strong believer that observability and the response to system degradation should be owned by the team developing the application. This is a primary tenant of DevOps. The team will write better code and add the necessary observability bits if they are the ones responding to pages at 3 AM! You may have teams in your organization that provide observability and monitoring infrastructure, but the application team should be responsible for publishing observability data and responding to failed system events.

Metrics

Metrics are at the forefront of observability because they provide the data to make decisions when the system isn’t behaving as it should. Below we’ll discuss three methods to gather metrics in the context of our Lambda function that was discussed in Build an HL7 Data Lake.

Example #1 — CloudWatch Logs with metric filters

AWS Lambda integrates natively with Amazon CloudWatch Logs. All logs and print statements generated by the HL7 processing function are sent to a CloudWatch Logs log group. Further, you have the ability to filter logs based on patterns and create metrics. We’ll keep it simple for our HL7 example and create metrics around the number of successful and failed HL7 messages processed. To do so, we print a success statement when we’ve finished processing the message in our HL7 Lambda code:

success = {
'result': 'SUCCESS',
'message': SUCCESS,
'lambdaFunctionName': context.function_name,
}
print(json.dumps(success))

We then create a metric filter in the CDK infrastructure code based on this pattern, filtering on result:

new logs.MetricFilter(this, 'MetricsExample1Success', {
logGroup: hl7Lambda.logGroup,
metricNamespace: namingPrefix,
metricName: metricOneSuccess,
filterPattern: logs.FilterPattern.stringValue('$.result', '=', 'SUCCESS'),
metricValue: '1',
dimensions: {
'LambdaFunctionName': '$.lambdaFunctionName'
}
})

We’ll skip the failure code since it’s almost identical, but running messages through the system, we see the following in CloudWatch:

Observability CloudWatch

Example #2 — CloudWatch Logs with embedded metric format

CloudWatch embedded metric format is another method to publish metrics to CloudWatch. In this scenario, you publish logs to CloudWatch Logs with a specific JavaScript Object Notation (JSON) format, instructing CloudWatch Logs to extract metrics embedded in the JSON. The AWS documentation has examples of what the JSON schema needs to look like in order for CloudWatch Logs to automatically extract metrics:

{
"_aws": {
"CloudWatchMetrics": [
{
"Metrics": [
{
"Name": "Time",
"Unit": "Milliseconds",
"StorageResolution": 60
}
],
...
}
]
},
"Time": 1
}

Using the PutLogEvents API, the JSON is sent to CloudWatch Logs, as demonstrated here in the AWS documentation. However, you can see from the verbosity of the example that this would quickly bloat your code. Thankfully, AWS provides a Python package called aws-embedded-metrics to make publishing metrics in this format trivial (source code here). I’ve added a Lambda layer (a convenient way to package libraries and dependencies and make them available to our code) to the Github repo, importing it as follows:

from aws_embedded_metrics import metric_scope

We then decorate the handler and update its signature:

@metric_scope
def handler(event, context, metrics):

With the help of the Python package, writing to CloudWatch Logs in the embedded metric format is as easy as the code below:

metrics.set_namespace(NAMING_PREFIX)
metrics.set_dimensions({'LambdaFunctionName': context.function_name})
metrics.put_metric('Example_2_Success', 1, 'Count')

Example #3 — Prometheus

Lastly, we’ll demonstrate how to export metrics to Amazon Managed Service for Prometheus for consumption by Amazon Managed Service for Grafana. This took some trial and error, as the documentation around this subject isn’t expansive. The first step is to include a Lambda layer provided by AWS here. The layer provides a reduced AWS Distro for OpenTelemetry (ADOT) Collector and is a downstream repo of opentelemetry-lambda, which provides code to export metrics asynchronously from AWS Lambdas. The collector is important because it provides a uniform way to “receive, process, and export telemetry data” that is in line with the OpenTelemetry specification. Rather than worrying about how to send metric data to our backend, we leverage the Collector and specify our backend details in a configuration file (described below). Here’s a diagram showing how metrics are exported:

Observability Architecture

To leverage this, we must include the following environment variables with the Lambda function:

AWS_LAMBDA_EXEC_WRAPPER: /opt/otel-instrument
OPENTELEMETRY_COLLECTOR_CONFIG_FILE: /var/task/collector.yaml

We then add a file called collector.yaml to our deployment package:

extensions:
sigv4auth:
service: "aps"
region: "us-east-2"
receivers:
otlp:
protocols:
http:
exporters:
prometheusremotewrite:
endpoint: "<prometheus_url_from_aws_console>/api/v1/remote_write"
auth:
authenticator: sigv4auth
logging:
loglevel: debug
service:
extensions: [sigv4auth]
pipelines:
metrics:
receivers: [otlp]
processors: []
exporters: [logging, prometheusremotewrite]
telemetry:
logs:
level: debug

Admittedly, this took some effort to get working. I removed the gRPC protocol from the OTLP receivers. Upon looking at the upstream code mentioned above, I found that OTLP over gRPC is not supported by the Python SDK (at the time of this writing). You can either remove it from collector.yaml, or specify the following environment variable to default to HTTP:

OTEL_EXPORTER_OTLP_PROTOCOL: http/protobuf

It’s also necessary to enable active tracing for the Lambda function under Monitoring and operations tools:

Finally, by adding a few bits of code to our Lambda function, we can send metrics to Prometheus as follows:

from opentelemetry import metrics as ot_metrics
from opentelemetry.metrics import get_meter_provider
meter = ot_metrics.get_meter(__name__)
meter_provider = get_meter_provider()
counter = meter.create_counter(
name="invocation_count", unit="1", description="Number of invocations"
)
...
counter.add(1)
if hasattr(meter_provider, 'force_flush'):
meter_provider.force_flush(1000)

Running test data through the system, we can see the invocation_count metric displayed in Grafana after it's sent to Prometheus:

I’ll leave out the bits about connecting Grafana with Prometheus, but drop me a note if you have any issues.

Further reading on OpenTelemetry

Alarms

Metrics are great for monitoring our application, but we also need to be alerted when metrics cross thresholds because of system degradation. We can do this with Amazon CloudWatch alarms. First, we add to our CDK infrastructure code, creating a math expression using failure and total invocation metrics, and calculating the percentage of all invocations that are failures:

const metricFailurePercentage = new cw.MathExpression({
label: 'HL7_Processing_Failure_Percentage',
period: Duration.minutes(5),
expression: "100 * (m1/m2)",
usingMetrics: {
m1: metricFailures,
m2: metricTotal,
}
})

We then create an alarm in the CDK code, triggering it when the failure percentage rises above 5%:

new cw.Alarm(this, "Hl7FailureAlarm", {
metric: metricFailurePercentage,
alarmName: "HL7 Failure Percentage",
alarmDescription: "HL7 Failure Percentage",
datapointsToAlarm: 1,
evaluationPeriods: 1,
threshold: 5,
comparisonOperator: cw.ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
treatMissingData: cw.TreatMissingData.IGNORE,
})

When the alarm threshold is crossed, it can trigger a number of actions. For example, the alarm may publish to an Amazon SNS topic, which in turn could page an on-call engineer.

Tracing

Tracing allows us to see requests as they traverse through the system and the parts of the system that may introduce bottlenecks and high latencies. In our example, we can see how long it takes to read HL7 data from Amazon S3, put JSON into Amazon Kinesis Data Firehose, and the startup time for the Lambda container.

Tracing often requires extension code changes, but we will leverage the same Lambda layer we used for Prometheus to enable auto-instrumentation: aws-otel-lambda. This Lambda layer bundles OpenTelemetry Python and a minimal AWS Distro for OpenTelemetry (ADOT) Collector. I’ve linked two excellent AWS resources at the end of this section that provides additional context on these resources. For the purpose of this blog, I’ll outline how we implemented them. First, we update collector.yaml with the following:

exporters:
awsxray:
...
service:
extensions: [sigv4auth]
pipelines:
traces:
receivers: [otlp]
exporters: [logging, awsxray]
...

We also add permission to the Lambda execution role, specifically:

xray:PutTraceSegments
xray:PutTelemetryRecords

We previously enabled active tracing for the Lambda function under Monitoring and operations tools, but it is required for this. Also, the environment variable below is necessary, but again we’ve added it for writing metrics to Prometheus:

AWS_LAMBDA_EXEC_WRAPPER: /opt/otel-instrument

Running data through the system, we see a breakdown of various parts of the code:

Observability Timeline

Further reading on tracing

Conclusion

I hope this blog was informative. Drop me a note if you have questions or comments.

--

--

Michael Sambol
Cloud Native Daily

Software engineer and sporadic YouTuber. I write about software development and make videos on data structures and algorithms.