How to monitor Lambda with CloudWatch Metrics

Published in

Lumigo

8 min readJul 2, 2019

With AWS Lambda, you have basic observability built into the platform with CloudWatch. CloudWatch offers support for both metrics and logging.

CloudWatch Metrics gives you basic metrics, visualization and alerting while CloudWatch Logs captures everything that is written to stdout and stderr. In this post, we will take a deep dive into CloudWatch Metrics to see how you can use it to monitor your Lambda functions and its limitations.

CloudWatch Metrics

You get all the basic telemetry about the health of a function out of the box:

Invocation Count
Invocation Duration
Error Count
Throttled Count

In addition, you also have some metrics that are only relevant to specific event sources:

Iterator Age: only relevant for functions with a Kinesis or DynamoDB stream event source. This metric measures the time between a record being written to the stream and when it was received by your function. It tells you if your function is falling behind in the stream.
DLQ Errors: only relevant for async event sources (e.g. S3, SNS, CloudWatch Events) where you can configure a dead letter queue (DLQ) to catch failed events. This metric tells you that the Lambda service is not able to publish the failed events to the designated DLQ.

In addition to these built-in metrics, you can also record custom metrics and publish them to CloudWatch Metrics. You can do this in a number of ways, including:

Make PutMetricData API requests. But keep in mind that making these API calls as part of your function’s invocation would add to its execution time. For user-facing APIs, this approach would increase user-facing latency.
Write custom metrics to stdout, which are then captured by CloudWatch Logs. You can then use metric filters to parse and capture them as metrics. Since logs are collected asynchronously, this approach does not have any latency overhead. However, it adds delay to when you will see those metrics.

CloudWatch Metrics Limitations

A number of valuable metrics are sadly missing, including:

Concurrent Executions: CloudWatch does report this metric, but only for functions with reserved concurrency. However, it’s a useful metric to have for all functions.
Cold Start Count
Memory Usage and Billed Duration: Lambda reports these in CloudWatch Logs, at the end of every invocation. But they are not available as metrics. You can, however, turn them into custom metrics using metric filters.
Timeout Count: timeouts are a special type of systematic error that should be recorded as a separate metric. So often I have seen teams waste valuable time searching for error messages in the logs, only to realize that there wasn’t any because their function had timed out. Instead, you should log these timeout events and use metric filters to record them as a custom metric.
Estimated Cost: another useful metric to have would be the estimated cost of a function. This can help you make informed decisions on which functions to optimize. For example, it makes no sense to optimize a function whose net spend per month is $10. The effort and cost of optimizing the function would far outweigh any potential savings.

Another problem with CloudWatch Metrics is that its percentile metrics for Lambda doesn’t work consistently. When it comes to monitoring latencies, should be using percentiles instead of the average. However, when a function experiences more than ~100 invocations per minute, the percentile latencies stop working! This is a critical issue that we have raised with AWS, and hopefully, it will be addressed in the near future. In the meantime, you can fall back to using a combination of average and max duration. For APIs, you can also use API Gateway’s Latency and IntegrationLatency metrics instead.

CloudWatch Dashboards

You can also set up dashboards in CloudWatch at a cost of $3 per month per dashboard (first 3 are free). CloudWatch supports a variety of widget types, and you can even include query results from CloudWatch Logs Insights.

You can compose your dashboards with any metrics from CloudWatch (including custom metrics). For example, the following dashboard is composed of several API Gateway metrics and highlights the health and performance of an API.

You can also use Metric Math to create computed metrics and include them in your dashboards. For example, the Status Codes widget below uses Metric Math to calculate the number of 2XX responses which is not available as a metric.

Once you have handcrafted your dashboard. You can click Actions, View/edit source to see the code behind for the dashboard.

You can then codify the dashboard as an AWS::CloudWatch::Dashboard resource in a CloudFormation template. You will have to parameterize some of the fields such as API name and region so that the template can be used for different stages and regions.

Designing service dashboards

As a rule of thumb, you should limit dashboards to only the most relevant and significant information about the health of a system. For APIs, consider including the following:

95th/99th percentile and max response times.
The number of 2XX, 4XX and 5XX.
The error rate, i.e. the percentage of requests that did not complete successfully.

It’s simple and tells me the general health of the API at a glance.

“Keeping it simple” is easily the most important advice for building effective dashboards. It’s also the most difficult to follow because the temptation is always to add more information to dashboards. As a result, they often end up cluttered, confusing to read and slow to render as there are far too many data points on the screen.

Here are a few tips for building service dashboards:

Use simple (boring) visualizations.
Use horizontal annotations to mark SLA thresholds, etc.
Use a consistent colour scheme.
Put the most important metrics at the top to create a hierarchy. Also bear in mind that widgets below the fold are rarely seen.

This page has some simple guidelines for designing dashboards. Stephen Few’s Information Dashboard Design is also a great read if you want to dive deeper into data visualization with dashboards.

CloudWatch Metrics Alerting

Besides the per-function metrics, CloudWatch also reports a number of metrics that are aggregated across all functions:

While most of these aren’t very useful (given the lack of specificity), I strongly recommend that you set up an alert against the ConcurrentExecutions metric. Set the alert threshold to ~80% of the regional concurrency limit (defaults to 1000 in most regions). When you raise this soft limit via support, don’t forget to update the alert to reflect the new regional limit.

For individual functions, consider adding the following alerts for each:

Error rate: use metric math to calculate the error rate (error count / invocations count). Alert when the error rate goes above say, 1%.
Timeouts: as discussed earlier, CloudWatch does not publish a separate metric for timeout errors. Instead, you should create a metric filter to capture timeout messages (see below) as a custom metric and set an alert on it.

Iterator age: for stream-based functions, set an alert against the IteratorAge metric so you know when your function is drifting behind.
SQS message age: for SQS functions, set an alert against the ApproximateAgeOfOldestMessage metric on the queue. As this metric goes up, it signals that your SQS function is not keeping up with throughput.
DLQ errors: set an alert when the number of DLQ errors is greater than 0. This is usually a bad sign. The DLQ is your last chance to capture failed events before they’re lost. So if Lambda is not able to publish them to the DLQ then data is lost.
Throttling: we sometimes use reserved concurrency to limit the max concurrency of a function and throttling would be expected behaviour in those cases. But for functions that do not have a reserved concurrency, we should have alerts for when they’re throttled. This is especially true for user-facing API functions, where we cannot count on built-in retries and the throttling impacts user experience.
API latency: for APIs, especially user-facing APIs, you should set up alerts based on your SLA/SLO. For example, alert when the 95 percentile latency is over 3s for five consecutive minutes. This alerts you to degraded performances in the system. It’s possible to do this with Lambda duration too. But I find it better to alert with API Gateway’s Latency metric because it’s closer to an end-to-end metric. If the degraded performance is due to problems in API Gateway, you still want to be notified as it has user impact nonetheless.

So that’s a lot of alerts we have to set up! Since most of them follow a certain convention, we should automate the process of creating them. The ACloudGuru team created a handy plugin for the Serverless framework. However, it still requires a lot of configuration, especially if you don’t agree with the plugin’s defaults.

My preferred approach is to automatically create alerts CloudFormation macros. If you want to learn more about CloudFormation macros and how to create them, check out this excellent post by Alex Debrie.

Summary

In this post, we took a deep dive into how you can use CloudWatch Metrics to monitor your Lambda functions.

We looked at the metrics that you get out-of-the-box, and how to publish custom metrics. We explored some of the limitations with CloudWatch Metrics. We saw what you can do with dashboards in CloudWatch and discussed some tips for designing a service dashboard. Finally, we discussed what alerts you should set up so that you are duly notified when things go wrong.

In my next post, we will take a deep dive into CloudWatch Logs to see how you can use it to help debug issues and the limits with CloudWatch Logs.

See you next time! And, of course, don’t hesitate to get in touch if you have any questions about this article or CloudWatch Metrics in general.

Monitor & debug your serverless application effortlessly! Get alerted as soon as an issue occurs and instantly drill down to see a virtual stack trace & correlated logs.

Set up your free Lumigo account today & start fixing serverless issues in a fraction of the time! Find out more

Originally published at https://lumigo.io on July 2, 2019.