Serverless observability made easy with Lambda Powertools for Java — Metrics & Logging

This post describes how we simplified our code and improved logging, tracing and metrics collection in our API and Lambda functions, using Lambda Powertools for Java. This second part is focused on logging and metrics.

Disclaimer
I Love My Local Farmer is a fictional company inspired by customer interactions with AWS Solutions Architects. Any stories told in this blog are not related to a specific customer. Similarities with any real companies, people, or situations are purely coincidental. Stories in this blog represent the views of the authors and are not endorsed by AWS.

In the first part of this series, we introduced the tree components of observability — Logging, Tracing and Metrics — and detailed the tracing pillar.

This second part comes back on the metrics we implemented last year (see article) and how we simplified and enriched our metrics collection with Lambda Powertools. It also dives into the logging details, especially about adding contextual information in the logs and using structured logging, and how Lambda Powertools cares about this for us.

Metrics

Last year we used the Embedded Metrics Format (EMF) to create metrics from our Lambda functions. EMF is a JSON formatted log standard that allows CloudWatch to automatically extract metrics from CloudWatch logs. Now that we’ve added Lambda Powertools to our project, we wanted to see if it could also simplify our metrics code. We first add the powertools-metrics dependency in our build.gradle file:

dependencies {
aspect ‘software.amazon.lambda:powertools-metrics:1.12.1’
}

And remove the dependency software.amazon.cloudwatchlogs:aws-embedded-metrics. Then, in the code, we only had to make some minor changes:

  • We retrieve the MetricsLogger from Lambda Powertools instead of instantiating one:
MetricsLogger metricsLogger = MetricsUtils.metricsLogger();
  • While it’s still possible, we don’t need to specify the namespace everywhere in the code. We can set it using an environment variable:
env.put(“POWERTOOLS_METRICS_NAMESPACE”, “DeliveryApi”);
  • Adding the @Metrics annotation allows us to wrap the handler method and permits us to remove the call to flush at the end of the method. Not only that, but Lambda Powertools also ensures that metrics are flushed for every call, even in the case of an exception withinin the function. This greatly simplifies the error handling (finally blocks) in our code. The annotation also provides the captureColdStart option that we can set to automatically capture the number of cold starts as an additional metric.
@Tracing
@Metrics(captureColdStart = true)

public APIGatewayProxyResponseEvent handleRequest(APIGatewayProxyRequestEvent input, Context context) {
// ...
}
  • Apart from that, we can add dimensions and collect metrics exactly the same way as before:
metricsLogger.putDimensions(DimensionSet.of("FunctionName", "CreateSlots"));
metricsLogger.putMetric("SlotsCreated", rowsUpdated, Unit.COUNT);
  • As a result, we have almost the same logs in Cloudwatch:
X-Ray trace id in the metrics
X-Ray trace id in the metrics
  • The main difference in the logs is that we now have a function_request_id and xray_trace_id. Those 2 identifiers, and especially the trace id, are really helpful when investigating an issue. Thanks to it, we can correlate the collected metrics and the trace in X-Ray. For example, it would be normal for the CreateSlots function to take more time when multiple slots have to be created.
  • In addition we also have the cold starts, which is actually a really important metric to track to provide the best experience to our users, for example enabling provisioned concurrency.
Cold starts automatically measured by powertools-metrics
Cold starts automatically measured by powertools-metrics

In terms of metrics, since we already had implemented EMF, the step to move to Lambda Powertools is very low. It slightly simplifies the collection of metrics: thanks to the annotation, we ensure that the flush method is always called and we quickly forget the frustration of last year, when metrics were not showing in the logs due to missing that call. But the real benefit is the integration with the tracing module. Having this trace id in the logs is clearly a must and really helps in troubleshooting issues.

Logging

As mentioned in the introduction, we already had a set of logs in the Lambda functions of our Delivery API. Unfortunately we discovered the hard way they were not really useful:

  1. First, we had a bunch of debug logs, … which are obviously disabled in production. In order to get some debug logs in prod, we’d need to modify our log4j2.xml config file and redeploy the functions. Not really ideal during a crisis.
  2. Second, the few logs we had in Cloudwath were raw messages. While this is humanly readable, it is really hard to search and extract valuable information without being a grep expert or using more advanced/expensive tools (ElasticSearch, Splunk, Datadog, …).
  3. Lastly, the logs were missing a lot of context information: a user id for example, the farm id, … a request id or the famous trace id I mentioned earlier. Due to this lack of information, they were really hard to filter and to correlate with any event.

Introducing Lambda Powertools logging module enabled us to solve all this problems. Just like the other modules, we need to add the dependency to our build.gradle file:

dependencies {
aspect ‘software.amazon.lambda:powertools-logging:1.12.1’
}

Log level

With Lambda Powertools, changing the log level is as easy as setting an environment variable for your Lambda function: POWERTOOLS_LOG_LEVEL. By changing this variable to DEBUG, we are now able to get much more logs… much much more, too much actually.

And Lambda Powertools comes with a handy solution: sampling. We can let the log level at a higher level (WARN for example) and configure another environment variable: POWERTOOLS_LOGGER_SAMPLE_RATE. By setting this variable to let say0.1 , we get 10% of Lambda calls with logs at the DEBUG level while keeping the rest at WARN level. During an incident, if this is not enough, we can increase this value to get more verbose logs and be able to troubleshoot. Without redeploying the function.

Structured logging and additional context

These two points are probably the most useful pieces we’ve recently added in all our projects. And that’s not a surprise to find them also in the Well-Architected Framework:

Well-Architected Framework — Serverless Lens — Operational Excellence

The additional context is mandatory to understand what happened. A log message must contain an identifier, be it technical — like the trace id or a request id — or business like in our case the farm id and user id. Ideally it will contain both, so that you are able to correlate a technical event with a user issue or a user event with a technical issue. Otherwise it is completely useless.

The structured logging, on the other side, permits to filter and query logs more easily, so you don’t spend hours browsing and searching something in the logs.

Lambda Powertools provides both without much effort:

  • To get the structured logging, we need to configure our log4j2.xml to use the JsonTemplateLayout and a layout provided in the Lambda Powertools library:
<?xml version="1.0" encoding="UTF-8"?>
<Configuration>
<Appenders>
<Console name="JsonAppender" target="SYSTEM_OUT">
<JsonTemplateLayout eventTemplateUri="classpath:LambdaJsonLayout.json" />
</Console>
</Appenders>
<Loggers>
<Logger name="JsonLogger" level="INFO" additivity="false">
<AppenderRef ref="JsonAppender"/>
</Logger>
<Root level="info">
<AppenderRef ref="JsonAppender"/>
</Root>
</Loggers>
</Configuration>
  • To get the additional context, we simply add the @Logging annotation to the handleRequest method. We can complement this with additional information using custom keys, in our case the farmId:
@Logging
@Tracing
@Metrics
public APIGatewayProxyResponseEvent handleRequest(APIGatewayProxyRequestEvent input, Context context) {
// ...
LoggingUtils.appendKey("farmId", farmId);// ...
}
  • Optionally we can log the event that triggered the function using the logEvent option: @Logging(logEvent = true).

As a final result we get the following in CloudWatch logs:

Structured logs (JSON)
Structured logs (JSON)

In addition to the standard timestamp, log level, class and message, we get a lot of valuable information:

  • Information about the function itself (arn, name, version).
  • If it was during a cold start or no.
  • The “business” context (farm id, user id, slot id).
  • And last but not least the X-Ray trace id that will permit us to correlate this with others logs, and to jump in X-Ray for further analysis of the request.

And this is were it becomes very interesting. Using CloudWatch Logs Insights, it is possible to query our logs, apply filters, and get some stats. Thanks to the structured logs, CloudWatch Logs Insights is able to discover fields (on the right in the screenshot below). Fields that we can then use on the left to query our logs. For example here, I’m looking for all the events for a specific farm id, across two Lambda functions:

CloudWatch Logs Insights (filter & query on the left, discovered fields on the right)
CloudWatch Logs Insights

The logging module of Lambda Powertools is probably the one that has brought us the most value in our troubleshooting sessions, together with CloudWatch Logs Insights. Structured logs and additional elements of context really helped us figure out a few issues.

Side Notes on performance

Even though we know that Lambda Powertools simplifies observability, we also didn’t want to slow down our application with new libraries. As we didn’t really measure durations and latencies before adding Lambda Powertools, it’s not easy to say. But now that we know CloudWatch Logs Insights, I tried to get numbers before and after and here they are:

Percentiles (duration & initDuration) for the BookingSlot Lambda function without Lambda Powertools
Percentiles (duration & initDuration) for the BookingSlot Lambda function with Lambda Powertools (Logging/Metrics/Tracing)

It’s not totally fair for the test with Lambda Powertools, as we have added Tracing and also a bunch of additional logs. The logging module also provides much more than what we had initially. And we clearly prefer focusing on providing features for our farmers rather than implementing this by ourself, in each of our Lambda functions.

Conclusion

Everything fails, all the time. — Werner Vogels

That’s what happened to us, and we didn’t have the right tools to figure out what happened: almost no usable log and no trace. We had metrics and dashboard and alarms, and we knew something was happening but no way to troubleshoot. By implementing tracing, structured logging and adding more context to our logs, it is now easier to find out an issue and to solve it.

We could have done like we did last year for metrics and implement it on our own, but using Lambda Powertools for Java really simplifies things. With just an annotation and a few lines of code, we get everything. We can spend time to optimise it and deliver features. And using the three modules together provides really useful insights on the health of our application.

See the source code on GiHub.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store