Monitoring Serverless applications using AWS CloudWatch and Grafana

Published in

limehome-engineering

5 min readJul 20, 2022

Limehome is a Serverless first company that’s mainly deploying its micro-services in the AWS Cloud so we’re making great use of AWS Lambda and AWS API Gateway services. That’s why the classic servers metrics such as CPU Usage/ Disk Usage/ Available memory… are no longer relevant in our case.

One of our first goals as a platform team is to define the right metrics to increase observability and create the optimal alerts needed.

The challenges we faced are:

Identify metrics that can be continuously collected, so metrics can’t be tied to resources that are volatile like Lambda functions.
Minimize costs allocated to chosen solutions by leveraging existing services.

Below we’ll be depicting our approach on how we introduced effective monitoring.

Metrics

Standard AWS CloudWatch metrics

AWS CloudWatch offers a variety of default metrics available for each service. You can check the full list in detail under the related AWS Namespace. For instance you can find the API Gateway and Lambda namespaces among others.

Some of the useful metrics that we’re visualizing in Grafana:

Lambda related metrics: Errors/ Invocations/ Duration/ Throttles/ Concurrent executions

Cloudwatch default Lambda metrics in Grafana

API Gateway related metrics: 5XX Error/ 4XX Error/ Count/ Latency/ Integration latency

Cloudwatch default API Gateway metrics in Grafana

Create metrics based on AWS CloudWatch logs filters:

Since we needed to add more custom metrics that developers can easily configure, we thought about creating our own metrics based on AWS Cloudwatch logs that we generally collect.

The most used Log group is tied to API Gateway Custom Access Logging. It’s an extra logging that you can enable and you can customize the logging destination and format:

Once the logs start to be streamed you can create metric filters and start collecting some useful data.

Examples of filters:

Count the successful POST requests ( with status code 200):

{ ($.path = “/your_path/*/*”) && ($.status = “200”) && $.httpMethod = “POST”}

Count the unsuccessful POST requests ( with status code 400 or 500):

{ ($.path = “/your_path/*/*”) && ($.status = “400” || $.status = “500” ) && $.httpMethod = “POST”}

Once metrics are created, you can start visualizing them in Grafana and create alerts on top of them using Grafana itself or another solution of your choice.

Please refer to AWS’s official documentation for further information.

Create your custom AWS CloudWatch metrics

Sometimes as well we need to create and push our own metrics that are tied to the application and we can’t really get them from anywhere else.

You can integrate that as part of our code so it pushes the metric data when an event happens for example.

Please refer to AWS’s documentation as well to learn more about the concept and how it can be implemented. Multiple programming languages are supported.

Alerts

Use Grafana Alerting:

Once metrics are being fetched correctly in Grafana, you can start creating alerts and define notification channels.

In our case we’re using Grafana v.7.5.5

Once you create a Graph panel and add a query to collect your metrics, you can add your alerts as well.

Example:

The alert will be triggered if the sum of the errors that happened during the last 3 hours is above 20.

Grafana alerting for now is facing a limitation to fire an alert if a condition is met in a specific time range. For example, we would like to be notified if invoices were not generated between specific hours (tied to when the process is run). We don’t want to be alerted if no invoices were created outside of that range since anyways no invoices will be generated that time. To mitigate that issue we have built our own solution presented below:

Custom Alerting Solution:

We have developed a solution built on top of a couple of AWS services: AWS EventBridge rules, Lambda and Cloudwatch integrated with Slack.

Since we always opt for Infrastructure as Code, this stack is developed using Serverless Framework so every resource is managed through code.

Mainly we have defined an alert class where you specify:

Cloudwatch metric info: namespace, dimension, name
Stat: Sum, Maximum, Average…
Start and End hours: the time range when we should check the metric
Threshold and operator: what’s the value to compare to and if it should be less or more
Message: To be sent to the configured slack channel

So we have an Eventbridge rule running every hour and triggering a lambda function. The latter is looping on our alerts list. If the start time of the alert is the current hour then pull the needed metrics from Cloudwatch. If conditions to fire an alert are met, send a slack message.

Sharing these good reads to learn more about our implementation:

You can adapt the custom solution based on your need by using for example Cloudwatch Alarms with a Lambda to push Slack messages…

You can as well explore Cloudwatch dashboards instead of using Grafana, we picked the latter for the ease of customisation of the panels and the cost. AWS offers Cloudwatch dashboards in the free tier and 3$ per dashboard per month.

Besides this setup, we’re using Sentry for application monitoring alongside with other tools to complement our observability.

Thanks for going through the article and happy implementation ! ✋

By the way, check out the open roles in our Tech and Engineering team.