AWS Lambda Observability Best Practices

Best practices for Lambda observability to identify and resolve issues quickly and ensure optimal performance for your serverless functions.

Moksha
Cloud Native Daily
8 min readJul 19, 2023

--

AWS Lambda is a computing service offered by Amazon Web Services. It revolutionizes application development by eliminating the need for manual provisioning and management of underlying infrastructure.

However, this infrastructure abstraction limits visibility into the inner functionalities of serverless functions. Therefore monitoring and debugging can be challenging when using serverless functions like Lambda. This is where observability tools like AWS CloudWatch, AWS X-Ray, Helios, and Datadog come into play. They provide essential insights into logs, metrics, and traces, enabling effective monitoring, debugging, and troubleshooting in serverless environments.

In this article, I will discuss combining these tools with best practices to improve observability in AWS Lambda. These best practices will help you identify and resolve issues quickly and ensure optimal performance for your serverless functions.

1. Best practices when using logs

Logging in AWS Lambda is often overlooked and misused, resulting in unexpectedly high costs for CloudWatch Logs. Hence, it is crucial to be mindful of what is logged, as each log message carries metadata of approximately 70 bytes, contributing to the storage costs. Additionally, every Lambda function with CloudWatch permissions generates START, END and REPORT logs for each execution, adding around 340 bytes per invocation.

Sample log events

Configure a retention period

You can configure a retention period for logs to minimize storage costs. In addition to cost considerations, it is important to ensure that logs provide sufficient information for debugging purposes. Therefore, enabling log levels based on severity, such as DEBUG, INFO, WARN, and ERROR, can be beneficial. By setting the appropriate log level, you can control the amount of logging data generated and focus on capturing relevant information for troubleshooting. These steps will help you balance effective debugging and cost efficiency in AWS Lambda functions.

custom:
logLevelMap:
prod: info
staging: info
logLevel: ${self:custom.logLevelMap.${opt:stage}, 'debug'}

provider:
environment:
LOG_LEVEL: ${self:custom.logLevel}

provider:
logRetentionInDays: 30

CloudWatch for Monitoring

You can use AWS CloudWatch Logs as a monitoring service that collects and stores logs from your AWS resources, applications, and services in near real-time. It allows you to view logs, search for specific error messages, and set alarms for specific events. When troubleshooting asynchronous AWS Lambda flows, CloudWatch Logs can provide valuable information about the execution of your Lambda functions. If an error occurs during the execution of a Lambda function, the error message and stack trace are also recorded in CloudWatch Logs.

In addition, you can also get the help of third-party monitoring tools like Helios to monitor and troubleshoot issues in AWS Lambda. It provides actionable insights into your Lambda workflow while allowing you to fetch logs from CloudWatch Directly to the Helios platform with a single click of a button.

Helios monitoring panel

Use JSON formatting

With CloudWatch you now have the option to use JSON format for log messages instead of plain strings. This improves the ease of filtering log events based on specific values allowing for more efficient and targeted log analysis.

{
"level": "info",
"message": "Data ingestion completed",
"data": {
"items": 42,
"failures": 7
}
}

ex:- Filter : { $.message = "Data ingestion completed" }
CloudWatch Log Insights

2. Best practices when using metrics

AWS Lambda automatically tracks several metrics like the number of requests (invocations), the duration of requests, memory usage, and error rates. Following best practices can help you to get the maximum insights from that information.

Cloudwash log metrics

Set alarms and threshold values

Use CloudWatch Alarms to monitor your Lambda metrics and trigger notifications when certain thresholds are breached. You can set up alarms for each metric that you want to track, specifying the threshold values that indicate potential issues or anomalies.

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html

These threshold values should be customized based on your application’s expected behavior and performance requirements. Hence, consider factors such as acceptable response times, error rates, or concurrency limits. Also, It’s important to strike a balance between being alerted for critical issues and avoiding unnecessary notifications for minor fluctuations.

You can also use Helios to set up alerts and notifications for Lambda functions. Helios supports customizations based on either applicative events or Lambda metrics from AWS.

Helios Alert and notifications

Use custom metrics

In addition to the standard metric types available for AWS Lambda, you can define custom metrics to track application-specific custom statistics. Custom metrics can be either in standard resolution or in high resolution. High-resolution metrics provide more immediate insights into Lambda invocations and promptly enable the generation of monitoring alarms. Furthermore, a single metric can have multiple dimensions, allowing for layered analysis based on the application domain.

3. Best practices in distributed tracing

Microservices have revolutionized software development by breaking down monolithic applications into smaller, loosely coupled services. Although this shift towards distributed systems brings several benefits, it also introduces several challenges in managing and monitoring due to their interdependency and the complexity of their interactions.

Use AWS X-Ray

AWS X-Ray provides valuable insights into the performance and behavior of your distributed systems, enabling you to understand the flow of requests across microservices. It provides trace records consisting of three primary subsegments.

Trace Records: Trace records in AWS X-Ray consist of three primary subsegments that capture different phases of the Lambda function’s lifecycle:

Trace records consist of three primary subsegments.

  • The initialization subsegment: Encompasses various tasks performed by Lambda, such as resource configuration, environment creation or unfreezing, function code and layer downloading, extension initialization, runtime initialization, and execution of function initialization code.
  • The invocation subsegment: Captures the phase of invoking the Lambda function handler, starting from runtime registration and concluding when the runtime is prepared to send a response.
  • The overhead subsegment: Represents the steps occurring between the runtime response and the subsequent call. During this phase, the runtime completes all necessary tasks related to the invocation and prepares to freeze the Lambda sandbox.

Furthermore, AWS X-Ray processes the transmitted trace data and generates a service map, providing a visual representation of the distributed system’s services and their dependencies. It also generates searchable summaries that help identify bottlenecks and analyze performance.

Helios is another tool with similar features for distributed tracing. Unlike AWS X-Ray which requires code changes, Helios offers simple instrumentation without code changes. It provides end-to-end visibility into microservices including AWS Lambda, allowing developers to monitor the performance and troubleshoot issues effectively.

👉 Check out the official Helios documentation to learn more:

Helios Lambda function trace

4. Best Practices in error handling and exception tracking

Error handling and exception tracking are essential for developing robust and reliable serverless applications. Appropriate handling of errors and exceptions makes it much easier for developers to debug and troubleshoot while improving the user experience.

For example, most errors can be extracted from log files in AWS CloudWatch Logs. Here are some types of errors that can be retrieved from Lambda (in Python):

# retrieve all errors from AWS Lambda
begin
# do stuff
rescue Aws::Lambda::Errors::ServiceError
# ...
end


# retrieve one error type from AWS Lambda ex: CodeStorageExceededException
begin# do stuff
rescue Aws::Lambda::Errors::CodeStorageExceededException
# ...
end


# Errors are returned in JSON format
def handler(event:, context:)
puts "Processing event..."
[1, 2, 3].thing("two")
"Success"
end

# sample
{
"errorMessage": "no implicit conversion of String into Integer",
"errorType": "Function",
"stackTrace": [
"/var/task/function.rb:3:in `thing'",
"/var/task/function.rb:3:in `handler'"
]
}

Visualizing errors

Tools like AWS X-Ray and Helios offer error-handling capabilities to visualize application components and analyze the requests that resulted in errors and performance bottlenecks. AWS Lambda status page in Helios provides an aggregated view of all Lambda functions instrumented by Helios. The status page provides a snapshot of the number of errors in a function along with other stats such as number of invocations and number of timeouts.

Helios Lambda status page

5. Best practices in performance optimization and profiling

To optimize application performance, continuous monitoring and analysis of metrics before and after Lambda implementation is essential. There are a few vital metrics to monitor to understand the performance and availability of your function:

  • Invocations (the total number of requests received by the function)
  • Duration (the total time spent on a given function)
  • Errors (the total number of errors that occurred in a function)
  • Throttles (the total number of invocations that did not result in the execution of actual function code).

Be cautious of the cost factor

The performance of Lambda functions is influenced by factors such as CPU and memory allocation. However, functions with higher memory allocations come at a higher cost, even though they may run faster. To optimize performance, you can gradually increase the memory allocation and conduct tests to determine the configurations that deliver better performance. AWS Lambda Power Tuning, powered by AWS Step Functions, is a state machine that helps optimize performance and cost configurations.

Optimize your code

Reducing the size of the code artifact deployed to Lambda functions is another way to optimize performance. By minimizing the amount of code shipped to the function, it can be downloaded and executed faster during cold starts. Additionally, instead of using one large function, breaking it down into smaller functions dedicated to specific goals can reduce latency and make maintenance easier.

Furthermore, moving initialization code outside the handler function ensures it only executes during the initial cold start, not during subsequent warm invocations. Incorporating multi-threaded programming practices to maximize parallel execution on multiple vCPU cores can significantly reduce the latency and cost of a function.

Wrapping up

Improving observability in Lambda functions involves focusing on various aspects, including logging, metrics, and traces. AWS provides a suite of tools, including AWS CloudWatch, AWS X-Ray, and AWS CloudTrail, to enhance application and infrastructure observability within the context of AWS Lambda. Additionally, Helios offers a comprehensive range of tools and solutions for visualizing and analyzing logs and metrics generated by Lambda functions. By leveraging these tools and best practices, developers can achieve a higher level of observability, leading to improved application reliability and performance.

--

--