The Annoying State of Lambda Observability

5 min readJun 27, 2019

Unless you’re willing to buy entirely into AWS CloudWatch or X-Ray for all of your serverless observability needs, the process of integrating observability systems to AWS Lambda is a frustrating endeavor.

Gathering events from applications in the dark, legacy, “serverful” world was a straightforward task — simply embed an agent process onto each server in your infrastructure to batch up system metrics and application events (“telemetry”) before asynchronously delivering them to your metrics/logs/event provider of choice.

In an AWS Lambda world, custom agents and unsanctioned background tasks are a pattern of the past. Function execution stats are collected exclusively by AWS CloudWatch, spans are sent via UDP to an X-Ray agent embedded on every Lambda host, and logs are collected asynchronously by some background process and relayed to CloudWatch Logs. In their serverless ecosystem, AWS has replaced all of the traditional agents with their own.

In many ways, these changes can be spun as a great convenience rather than as a vendor lock-in trap. After all, for those willing to buy in completely to the AWS observability ecosystem, no more configuration is required — simply emit traces to X-Ray or logs to CloudWatch, and make do with the AWS observability interface.

Yet many of us are already using non-AWS tooling or simply demand more than what CloudWatch or X-Ray can provide. What options exist to send events to observability tooling outside of the AWS walled garden?

In the current state of the world, the available strategies boil down to either:

Send telemetry directly to external observability tools during Lambda execution.
Scrape or trigger off the telemetry sent to CloudWatch and X-Ray to populate external providers.

Spoiler: neither option is ideal.

Send

Under the hood, Lambda functions are just containers that are unfrozen (or instantiated) when invoked by an event. Once a function has returned a response, it is frozen again where it waits for either another invocation or to be garbage collected and sent to Lambda heaven.

Since Lambda functions are frozen following their invocation, any telemetry must be sent during a function’s invocation. Otherwise, events batched up between invocations or sent in asynchronous background threads can be lost when functions are garbage collected between invocations.

This leads to the big problem with the “Send” approach: any runtime telemetry must either be sent across the network as events occur or batched into a report sent at the conclusion of the function’s execution (but before returning success). The sad state of affairs is that when using the “Send” approach, users must be comfortable either losing some of their events, or pay a latency penalty to send telemetry on every single Lambda invocation.

Major latency penalty aside, the “Send” approach has many benefits:

No additional infrastructure is required within AWS, since the telemetry is sent directly to the provider.
Telemetry is low latency. As soon as the Lambda function returns, we can feel secure that our events are being processed by our provider.
No additional cost is incurred to process events after the fact (as we’ll see with the “Scrape” approach below)

Scrape

The most common approach to bypass the per-invocation performance penalty of the “Send” approach is to instead “Scrape” CloudWatch and X-Ray to gather metrics/logs/traces into your provider of choice.

A commonly utilized strategy is for functions to print events or metrics to standard output. These standard output events can then be wired up to trigger a separate Lambda function which can then forward events to external logging or metrics services. Frustratingly, AWS only allows a single Lambda function to attach to any single CloudWatch “Log Group”, so users must set up a fan-out Lambda function to a Kinesis Data Stream to send events or metrics to multiple providers.

Gathering trace or invocation stats for a given function is even less elegant — providers must actively scrape the AWS API every X minutes (which if set too frequently exasperates the hidden AWS API limit). Manually scraping traces leads to another vendor lock-in trap: how do you tie together the traces sent to X-Ray with the traces from calling services? Vendors have been forced to come up with clever techniques to tie these traces together.

The “Scrape” method solves the exact opposite problems of “Send” — users are able to save latency Lambda invocations, but in turn are forced to build (potentially expensive) Rube Goldberg style machines to relay and scrape logs and traces from AWS’s products.

An Ideal World

Observability for Lambda is in a particularly annoying place — either buy in to AWS’s offerings (CloudWatch/X-Ray), take a performance hit, or build a Rube Goldberg machine.

Yet there could be a better way — rather than force users to build elaborate systems to “scrape” or IPC messages from inside Lambda functions, AWS could provide some type of UDP listening agent on each Lambda host — these agents could perform a similar function to the existing X-Ray agent, but rather than send events to AWS’s X-Ray service, forward them to a customer owned Kinesis Stream. Maybe even call them Lambda Event Streams.

In this ideal new world, users could plug and play Lambda functions (custom or vendor provided) to consume whatever log/event/metric/trace telemetry that their Lambda functions are generating. No more latency penalty on event generation, and no more Rube Goldberg machines on the backend!

Since AWS isn’t going to fix this issue over night, we’re all left with this annoying trade off: use the AWS observability tooling wholesale, send metrics during execution time, or scrape into our own systems. Personally, I’m going to stick with the “send approach” — paying the additional latency is a small cost compared to the hassle of setting up a series of points of failure inside my own infrastructure. I’d love to use AWS’s tooling for everyone, and think AWS is getting a lot closer with products like CloudWatch Insights, but I think that AWS needs to take custom CloudWatch metrics just a little bit more seriously for me to abandon third party products entirely.

Written by Luke Demi