4 Reasons Why You Should Publish Monitoring Data Async in AWS Lambda

Learn the motivation behind Thundra’s async monitoring data publish structure

Serkan Özal
Jan 4, 2018 · 4 min read

Many of the monitoring solutions available in the market aren’t built considering the nature of AWS Lambda environment. Many of these tools publish monitoring data in the request in a synchronous way which is an anti-pattern for serverless monitoring because of the following reasons:

Longer request duration:

Stateless environment:

Using background threads to publish monitor data is not a good idea either. Because while the container is not handling any request, it is in the frozen state. AWS Lambda doesn’t allocate CPU resource/slot in frozen state, so background threads can’t publish monitoring data. If data publishing takes place during request execution, it increases the duration as discussed in the previous problem.

In other words, you cannot send the monitoring data in batch if you don’t tolerate data loss. Thus, you need to send monitoring data before the invocation ends because your code can only be run when the container is active handling requests.

Data publish failures:

In case of data publish failures:

  • You may retry the publish attempts until they succeed. But there is still a chance of request failure because of the maximum 5 minute execution time limit. In this case, you would lose the monitoring data. Even though it succeeds, the function execution would be delayed during retries.
  • You may skip sending the data if you can tolerate data lost. If the failures end up quickly, the losses might be acceptable depending on your system monitoring needs and expectations. But the failures might continue during hours (or even days). In this case, if there is a problem with your Lambda functions, you will have no idea, and this is not acceptable for most systems.

Access within VPC:

Publishing monitoring data asynchronously

Capture the metric within your Lambda function code and log it using the provided logging mechanisms in Lambda. Then, create a CloudWatch Logs metric filter on the function streams to extract the metric and make it available in CloudWatch. Alternatively, create another Lambda function as a subscription filter on the CloudWatch Logs stream to push filtered log statements to another metrics solution. This path introduces more complexity and is not as near real-time as the previous solution for capturing metrics. However, it allows your function to more quickly create metrics through logging rather than making an external service request.

In our approach, trace, metric and log data are logged in a structured JSON format for CloudWatch through `com.amazonaws.services.lambda.runtime.LambdaLogger` provided by `com.amazonaws.services.lambda.runtime.Context`. Then, the printed monitor data logs are sent to CloudWatch asynchronously by AWS Lambda without affecting the request performance because logs printed through `LambdaLogger` are written to shared memory in the container under the hood on AWS Lambda to be sent to CloudWatch later in an async way. We also have another Lambda function; let’s call it “monitor lambda”, which subscribes to log groups of Lambda function to be monitored with a subscription filter to be only triggered by monitor data. Then the “monitor lambda” can send the received monitor data to ElasticSearch (directly or indirectly through Kinesis or Firehose stream) to be queried and analyzed later. Also, since the “monitor lambda” is invoked as Event invocation type (there are also Request/Response and DryRun invocation types) by CloudWatch, if a DLQ is specified for “monitor lambda” after a few retries, monitor data for failed invocations are put into specified SQS queue automatically by AWS. Thus, we don’t lose any monitor data. Then another Lambda function, which is triggered by scheduled CloudWatch event polls the DLQ and invokes the “monitor lambda” function with the polled monitor data.

The following diagram shows our async monitoring architecture:

Follow our Thundra blog and sign up for early access on thundra.io to see how AWS Lambda applications can be monitored with Thundra.

Interested in more?

Thundra

Unlock Visibility Into The CI Pipeline By Spotting Test Failures Immediately