Decoding the Digital Trail: A Deep Dive into Software Logging Part 2

Brandon Fajardo
Four Nine Digital
Published in
13 min readAug 7, 2024
An error log being held by a deep sea diver.

This blog marks part two of my two-part series on software logging. If you are new to the series, I encourage you to read part one to learn more about the basics of logging. Otherwise, if you would like to explore more advanced usage of logging through the AWS CloudWatch Logs service, feel free to stick around!

In this blog, we will begin by exploring what CloudWatch is and what the Logs feature offers. Next, we will define what a log event, log stream and log group is. And finally, we will attempt to solve real-world production issues through fictional scenarios that I will present. In this step, I hope to reveal aspects of logging that will excite you and also provide a bit of commentary on how we can begin to resolve such issues. So without further ado, let’s get to it!

Note: Although the blog primarily focuses on logging within the AWS ecosystem using CloudWatch, I encourage you to continue reading, as many of the features and debugging methods I will discuss are common across various logging services.

What is CloudWatch and CloudWatch Logs?

CloudWatch is a comprehensive monitoring and observability service designed to provide insights into the performance and operational health of AWS resources and applications.

It encompasses several features beyond just logs including but not limited to: metrics, alarms, dashboards, and events. CloudWatch Logs on the other hand, is a feature of the CloudWatch service that serves the purpose of managing and analyzing log data. In this blog we will spend more time with the latter.

To name a few key benefits, the Logs service:

  • Centralizes logs generated by various sources including AWS services, applications and custom sources.
  • Enables searching and filtering of log data.
  • Supports the creation of alarms based on specific log events or patterns.

Put simply, CloudWatch Logs is like a big diary for your entire application. It collects and stores all the important notes (logs) they write, no matter where they come from. This means within the AWS ecosystem, you could theoretically group all of your resources’ logs into one place to receive all the benefits listed above.

Consolidating all resource logs into one log group.

Note: Some AWS services have built-in support for CloudWatch Logs and automatically send logs without the need for additional configuration (ex. EC2, Lambda and RDS) while other services are a bit more tricky and may need additional setup.

What is a log event, log stream and log group?

Log event

A log event is simply a record of some activity that occurred in your system. Log events are the individual entries within a log stream.

Example of a log event.

Log stream

A log stream is a sequence of log events that share the same source. Log streams within a log group represent different instances or sources of your application or service.

Example of a log stream.

Log group

A log group is a container for log streams that share the same retention, monitoring and access control settings. Log groups help you organize logs for different resources, applications or environments. Consolidating your logs into a single “location” is accomplished by employing log groups.

Example of a log group.

You may choose to edit the log group for your specific service(s). This can be done through the AWS console or through code.

Editing the log group in the AWS console.

By understanding the difference between a log event, log stream and a log group you can begin to better navigate around the CloudWatch Logs console to manage and organize logs more effectively. As your system grows, having a clear understanding of these concepts will ensure that you can efficiently handle larger volumes of log data. A disorganized environment can lead to wasted time and effort when trying to locate relevant data, especially during critical incidents or high traffic days.

Let’s keep these concepts in mind as we will now make the pivot into the next section of the blog.

Setting the stage

Fictional band, Bear Claw

You work for a company that sells tickets to various events, including concerts, sports events, theatre performances, and other live entertainment shows. A famous rock band, Bear Claw, is looking to perform at a venue selling a total of thirty thousand tickets to be split between their Saturday and Sunday shows. The tickets will be sold on your website on Monday, the week of the show. Expecting a spike in traffic, you’ve been tasked by your manager to address and resolve any production issues on that day. It is now Monday and you are currently on standby.

Issue #1

It has been reported that many users have been experiencing issues purchasing a ticket for Sunday’s show. They are met with a notification error that reads “An error occurred”.

Error message received upon clicking the Purchase button.

This is a common problem in production where a front end flow (such as payments) gets compromised due to a faulty backend. In this scenario, let’s assume an AWS Lambda function (think of a regular function) gets triggered when the purchase button is clicked.

How might we use CloudWatch Logs to help us debug? Let’s consider what the purchase function might look like.

exports.handler = async (event) => {
const { username, artist, day, promo } = JSON.parse(event.body)
try {
const availableTicket = await getAvailableTicket(artist, day)

if (!availableTicket.length) throw Error(`${artist}: All tickets for
${day} are sold out.`)

await purchaseTicket(username, promo, availableTicket)

console.info(`Ticket to ${artist} for ${day} has been purchased
successfully by ${username}.`)

return {
statusCode: 200,
body: {
message: `Ticket to ${artist} for ${day} has been purchased
successfully.`
}
}
} catch (e) {
const logData = {
username,
promo,
message: `Error processing purchase of ticket. ${e.message}`
}
}
}

1. For starters, you can start by opening the CloudWatch Logs console and navigating to the Log groups tab on the left hand side. Since we want to debug the purchaseTicket Lambda function we can use the search bar and filter for it. Click into it.

Note: Lambda log groups get created after the first invocation by default.

Searching for the correct log group using the AWS console.

2. Click into the latest log stream and observe all the log events.

Observing the log events.

Resolution

Very easily, we can see that all tickets to Sunday’s show are sold out based on the log event message:

“Error processing purchase of ticket. Bear Claw: All tickets for 02–02–2024 are sold out”.

Looking at the function, we can see that getAvailableTickets likely returned an empty array, hence why an error was thrown and the console.error inside of the catch block was executed. Sadly however, it appears that the front end does not reflect the current state of the inventory so we can do one of two fixes:

  • Update the front end message to notify customers that tickets for this particular date have been sold out.
  • Push a temporary fix to disable the add button for Sunday’s show

Debrief

With the correct logs, in a matter of minutes (or even seconds) it is very quick to determine what went wrong. Given that the sample function provided was no more than 50 lines of code, we need to keep in mind that debugging functions could span hundreds of lines (if not more), can be deeply nested or use third party APIs, extending time to resolution.

Evidently, not all errors in an application can be attributed to a broken backend. In our case, this is a matter of inventory not being properly reflected on the frontend. As a backend developer, it is in our best interest to communicate this back to the front end team for resolution.

Issue #2

Your manager is upset that it came down to customer reports and complaints for you to be notified of any failures happening in the application. You begin to wonder if there is a way to be automatically notified when a customer experiences any error.

Upset manager.

Manual testing is not a very scalable approach to identifying bugs in a production environment. It can work as a sanity check after the website has been deployed, however, we cannot be sure that all bugs can be caught. In an ideal world, developers would get notified the moment an error occurs. Luckily for developers, this notion of having an alarm does exist and within the AWS ecosystem it exists through CloudWatch Alarms.

In some scenarios, you may want to create CloudWatch Alarms based on log data. If your logs contain specified error messages or patterns, you could create an alarm to trigger notifications or actions when those patterns are detected. For example, you could create an alarm (SMS, email, etc) for when 3 consecutive errors in a particular lambda function occurs in a span of three minutes.

Let’s keep this in mind as we proceed to set up an alarm for the purchaseTicket Lambda function.

  1. Let’s begin by opening the AWS console and navigating to the log group that we would like to monitor for errors — purchaseTicket in our case. We want to ensure alarms are received when a potential customer receives an error when a purchase attempt is made.

2. Locate the button that reads “Create metric filter”. Click it and fill in the following inputs.

Creating the metric filter.

Metric filters are rules that you define to extract specific information from log events and transform them into metric data. This allows you to create custom metrics based on the content of your log events. In our case, we want to extract errors. Click next and save changes.

3. Once the metric has been created, we need to now associate it with an alarm. Select the metric and click the “Create alarm” button on the right hand side of the screen.

List of metric filters.

4. Fill in the following inputs:

Form to configure an alarm.
Form to configure an alarm.

If you do not have an SNS topic, you can select the “Create new topic” option or open SNS (Simple Notification Service) in the console and add a topic from there.

Form to create a new SNS topic.

5. Click next and enter an alarm name and description, then hit the “Create alarm” button on the final step.

Note: On the Alarms page, if you have created a Topic, you will need to confirm your subscription to the topic in order to proceed.

List of alarms.
Email notification to subscribe to a topic.

6. For testing purposes, let’s remove the contents of the Lambda function and throw an error. Click Deploy and Test the function.

Lambda function code.

7. Observe notification after the configured time.

Notification received on a real device.

Voila! Just like that you are now able to identify errors the moment they occur.

Debrief

It is imperative that we create alerts for our applications so that we can be proactive and resolve issues as they come. Illustrated in the previous blog, any downtime in an application could have some serious affects which include but not limited to: customer loss, financial transactions not executed and vehicle software leading to accidents. Establishing alerts could help minimize potential damage.

In practice, within software teams, this could mean having a setup where one member is on rotation for a week to address and resolve any production errors that may come up. And for days where traffic is high, it could be a joint effort between multiple members. Doing it this way, ensures that issues can be quickly resolved the moment an alarm goes off.

Issue #3

A particular customer with the username brandon123 has reported an issue on the checkout page. He claims that purchasing a ticket results in an error message and does not provide any other details.

It is common for a single user to report an error and not provide all the details. While other users may not be experiencing the same issue, cases like this are often investigated in isolation. Because we want to hone in on a particular user, brandon123, we can take advantage of log filtering and try to find the logs generated for this user.

Let’s take a look at how we can use the CloudWatch Logs Insights feature to help us figure what went wrong.

  1. We can begin by opening the CloudWatch console and navigating to the Log Insights tab.
Navigating to the Logs Insights tab.

Log Insights allows you to select multiple log groups and perform queries against them to filter logs. The query itself is a language built by AWS (for the purpose of this demo, we will use the query generator to help us generate queries).

2. Select the log group(s). In this scenario, let’s assume that our application’s logs (including the purchaseTicket Lambda) have been consolidated into one log group named App-Logs.

Selecting log groups to run a query against.

4. Next, replace the current query with the following and click “Run query”

fields @message
| sort @timestamp desc

This query fetches log entries that includes the message content (@message), and sorts them by timestamp in descending order to display the most recent logs first. This ensures that our logs are not cluttered and only provides us fields that we are interested in.

Log events.

Because we are debugging a production application after all, we should expect log events with different log levels, sources and information. The output of this query could technically span from a few logs to even hundreds or thousands depending on user base.

5. Using the AI query generator, let’s only find error logs where the username property has a value equal to brandon123.

Note: Although the query generator can fail from time to time from poorly typed prompts, deferring query generation to ChatGPT is a great alternative.

Using the AI query generator.
Query results.

Resolution

As we can see, the customer made 3 attempts to purchase a ticket. In each attempt, it results in an error message that reads“Promo code: undefined does not exist”. If we assume promo code is null by default, undefined could suggest one of many things:

1. That the promo code they have entered is incorrect.
2. Upon searching for the promo code on the backend (inside of the purchaseTicket function) there was an error.

Whatever the cause may be, we have been able to identify that an undefined promo code has slipped into our system and we can start to figure out possible entries. Solutions can vary but it could be as simple as changing the error message on the front end to notify users of a broken promo code or adding a validation step (before purchase) to check if a promo code exists. If it exists or the promo field is left blank, customers can proceed with the purchase. If the promo does not exist, we can consider having the purchase button disabled.

Debrief

The significance of log levels and structured logging becomes evident within a tool like Log Insights. It becomes apparent how easily one can search and filter logs, and additionally leverage the query generator to pinpoint errors. It can handle large volumes of log data, making it suitable for applications of any scale. Not only will it expedite your ability to debug but it too will allow you to see trends through provided charts and diagrams. This makes Log Insights a great tool for any team to use.

Outro

This is the end and you’ve made it! I hope you enjoyed learning about logging as much as I enjoyed writing this blog. To give you a recap on what we’ve learned together in this entire series, we:

  • Learned the pros and cons of logging
  • Learned the semantics of logging and using the correct log levels
  • Learned what should be logged and how it should be logged
  • Learned CloudWatch, CloudWatch Logs, Alarms and Logs Insights
  • Learned how to better debug production errors

In a span of two blogs, hopefully I have convinced you enough that taking the time to write good logs can save you a lot of time, future heartache and stress. Because let’s face it, as your responsibilities grow as a software developer, one day someone will need answers regarding the health of your application; and one day, your application will need help firefighting in production. Having good logs will not only boost your confidence in the software that you are building but as well ease your mind when the inevitable storm decides to come.

Today, you can equip yourself and your team with the right logs and tools to become the hero you want to be — and the hero your team needs.

Happy logging!

Four Nine is a full-service Digital Agency located in Vancouver, BC. Learn more about us by visiting our website or contact us at info@fournine.digital.

--

--