Realtime Alerting from Applications Logs in Amazon CloudWatch

Published in

Insider Engineering

4 min readSep 11, 2023

Monitoring the health and performance is critical for any application and the business they are supporting. In case of an SLA breach or degradation, the application team must receive an immediate alert to investigate and resolve the issue.

Our data and machine learning teams are running applications using a wide variety of programming languages (Scala, Python, Go, Node.js) and environments (EMR, EKS, Elasticbeanstalk, Lambda). With such diversity, we needed to standardise the logging and alerting processes. This post covers how we utilised Amazon SQS, Lambda, and CloudWatch to build a unified logging and alerting architecture.

Our requirements were to:

Collect all application logs in a single service with a standard structure.
Query the application logs.
Generate real-time alerts for critical errors.
Generate batch alerts for non-critical errors.

Amazon CloudWatch

Unsurprisingly, Amazon CloudWatch is selected to be the centralised log repository, given its capabilities and integrations with other AWS services. We structured the logging by creating a new log group for each pipeline. Within these log groups, each application writes to its own log stream. This granular approach facilitates easier, quicker, and more cost-effective querying of individual application logs.

Cloudwatch does not require a pre-defined format for the log messages. However, in order to standardise the queries we use a fixed log schema. This allows us to create generic and powerful queries. We are using an error object in the log for alert generation. This object contains key information such as the Slack destination for the alert, the alert period, and specific details of the exception, including the stack trace. Below is a simplified example of a log object.

{
    "log_group": "my-log-group",
    "log_stream": "my-log-stream",
    "level": "error",
    "message": "The database connection is failed!",
    "error": {
        "slack": "#database-alerts",
        "period": "realtime",
        "exception": "ConnectionError",
        "stack_trace": "..."
    }
}

Sending the Logs

There are hundreds of applications running on production that should send their applications’ logs to a central service. We developed a logging library for those applications to easily install and use. Basically, the library sends the logs to an Amazon SQS queue. An AWS Lambda function consumes that SQS queue, validates the logs, groups them by their log group, and issues a PutLogEvents API calls through AWS SDK to send the logs to the Cloudwatch.

Querying Logs

Users are able to query the logs using CloudWatch Logs Insights. The query language allows us to create powerful queries. It is also possible to save these queries for future reference. Log message is in string format, but CloudWatch automatically discovers the fields in the message if it is in JSON format. It also allows the use of an elastic schema for the logs, for example, the error field can be omitted if it is not an error log.

Sending Automated Slack Alerts

Slack alerts are sent for every error log in the log groups used in this alerting architecture. The methodology to create and send events is divided into two due to different integration requirements for real-time and batch alerts. The alert messages include the issue and a hyperlink for Cloudwatch Insights query to investigate the details. This allows for immediate and deeper investigation into the specifics of the problem, streamlining the troubleshooting process.

Sample Slack alert with a Cloudwatch query for details

Real-time Alerts

Real-time alerts are configured with the error.period value set as “realtime” in the application logs. CloudWatch Logs Subscription Filters are used to enable delivering real-time alerts. Each Log Group should have a Subscription for this purpose. The Subscription is used to invoke a Lambda Function with filtered error logs which sends the alert message to the specified Slack channel or user in the error log using the Slack Service.

Sample CloudWatch Logs Subscription Filter

Batch Alerts

Batch alerts are configured with the error.period value set as “daily” or “hourly” which describes the aggregation period of these alerts. A separate Lambda function uses FilterLogEvents requests to fetch relevant log events. Once the data is retrieved from Cloudwatch Logs, the Lambda function sends the alert messages to the specified Slack channel or user in the error log using the Slack Service. This Lambda function is triggered hourly by an EventBridge event, queries the recent error logs, and generates corresponding alerts.

CloudWatch Dashboards

It’s worth noting that CloudWatch Logs Insights queries can be directly embedded into CloudWatch dashboards. This functionality enables the creation of specialised dashboards for error logs or application-based metrics, offering a unified view for monitoring multiple applications at a glance.

Conclusion

By leveraging AWS services like CloudWatch, Lambda, and SQS, we have successfully built a unified logging and alerting architecture that meets our specific needs. Real-time and batch alerts are delivered through Slack, allowing for immediate action and investigation by application teams. The CloudWatch Logs Insights offers powerful querying capabilities and dashboard integrations, giving us a complete view of our applications’ statuses.

If you want to read more about CloudWatch usage at Insider you can check this other post with more code examples and this PHP integration. Follow us on the Insider Engineering Blog to read more about our Agile Best Practices, AWS solutions at scale, and engineering stories.