Metrics for your AWS serverless services

“You cannot improve what you cannot measure”

In an increasingly complex world of distributed systems and micro services where services are being built and deployed continuously, monitoring the health of your services is key to successfully operating your applications at scale. For effective monitoring, each of your service would emit metrics that can help the observer understand the health of the service in isolation.

Metrics can help you gain quick insights into the current state of your application, help correlate how different sub systems of your application behave under load. Metrics can then serve as a navigational guide to proactively as well as reactively investigate and mitigate issues effectively. They can also help you analyze trends in your system as the input to your system changes or the system itself changes as part of your continuous deployment.

What to measure?

In this post, we are primarily looking at metrics in relation to AWS FaaS and related services aka Lambda, API Gateway and CloudFront.

One big mindset change when choosing what to measure for serverless services is that you don’t have to worry about system level metrics (CPU/Disk IO/Network) and primarily focus on application metrics. The key metrics that you want to measure will depend on the type of service that you are monitoring informed by the SLA’s that you have in place.

For a web application or an API, at a minimum, you would want to know:

  1. Count of requests: This is a core metric that indicates usage of your web application/API. Watch for trends where you want to make sure that any spikes are in line with expectations and not a result of DDoS or other spurious activity. Similarly pay close attention to unexpected drops as it can be an indicator of connectivity issues for your clients.
  2. Response time: This metric is one of the primary indicators of your application performance. Look for trends in response times as a deviation from normal indicates performance issues within the code and/or downstream systems.
  3. 4xx/5xx errors: While 4xx errors are mostly indicative of client errors but you still want to investigate when there is an increase in count/rate as it could indicate a bad rollout. 5xx errors indicate server side errors so these are issues with your code and hence you want to know when the frequency increases beyond a certain threshold.

For FaaS services, following metrics are a good start:

  1. Count of Invocations: This measure indicates number of times the function was invoked and this includes successful as well as failed invocations.
  2. Duration of Invocation: With serverless services, this is a number you want to optimize as the cost of your service is proportional to the duration of invocation. This along with count can give you the projected execution cost for your service.
  3. Error Count: Count of invocations that failed due to handled as well as unhandled exceptions inside the function and it includes OOM exceptions as well as timeouts.
  4. Throttle Count: Measures the number of function invocation attempts that were throttled due to invocation rates exceeding the AWS account concurrent limits.
Tip 1: Invocations count includes both success as well as error-ed invocations but excludes throttled ones. This is important to remember when you are troubleshooting increase in error rate frequency.
Tip 2: An increase in the count of error metric can be an indication that you need to revisit your lambda configuration.
Tip 3: Pay close attention to throttled invocation count as it indicates you are hitting AWS default limits and you would want to open a support ticket with AWS to have it increased.

How to measure?

If you are using AWS to build your web applications, API or functions, AWS CloudWatch automatically collects and tracks metrics for your cloud services without any additional configuration (for the most part).

You can access the metrics using AWS console, AWS CLI or their CloudWatch API. Alternatively, you can use Jazz, an OSS product, to manage the complete lifecycle of your serverless services. As part of v1.8 release of Jazz, you can view metrics for all your serverless services within the service details page.

1. API Gateway: AWS API Gateway integrates with CloudWatch to provide API execution related metrics in NRT. By default, successful/4xx/5xx counts, response times are available in one-minute periods.

Tip: The default metrics are a good start but to make the metrics useful, you would want to turn ON “detailed monitoring”. You can use the AWS console — step 6(d); alternatively you can also run the following command:
aws apigateway update-stage — rest-api-id ${YOUR_API_ID} — stage-name ${YOUR_API_STAGE} — patch-operations op=replace,path=/*/*/metrics/enabled,value=true — region ${YOUR_API_REGION}

With detailed monitoring turned ON, you can filter the above metrics by your API’s stage, resource and method which is what you need when you are trying to narrow down an issue.

2. CloudFront: You can view metrics related to your CloudFront website using the CloudWatch console. Request Count, Bytes Upload/Downloaded, 4xx/5xx error rates are available as part of the free tier.

Tip 1: One of the dimensions required for CloudFront is “Region” and the value for this must be “Global”.
Tip 2: When viewing metrics using the CloudWatch console, change region to N. Virginia (us-east-1) as all CloudFront metrics are stored in this region.

3. Lambda: AWS Lambda tracks your function invocations and publishes metrics like total invocation count, error and throttled invocations count to CloudWatch with no additional configuration required.

If you use Jazz, all of the configuration mentioned in this section is handled as part of the built-in CI/CD enabling your entire team and organization to get the integrated experience within Jazz’s dashboard.

Ephemeral serverless services combined with increased role of automation makes it harder to diagnose and troubleshoot issues in a live system. Collecting metrics is a critical part of managing such production applications, so it is important that you get started on this asap.