The Importance of Structured Logging In AWS (and Anywhere Else)

Using cloudwatch queries and cloudwatch logs insights to “dive deep” and increase system reliability

Connor Butch
12 min readMay 31, 2022
a man is shown staring intently through a magnifying glass
src

As always, you can jump to the code for this article by going here.

Yes, I know I’ve written about observability a lot.

I’ve written about using x-ray to it’s fullest extent by leveraging traces, annotations, metadata, and exceptions.

I’ve written about using emf to generate custom metrics, while giving you the capability to run queries on high cardinality properties (such as company Id) in a cost effective manner when needed.

I’ve written about using cloudwatch logs insights to generate a custom metric to quantify the number of lambda timeouts (which can be hard to debug using out of the box metrics).

I’ve written about using AspectJ to allow us to automatically time method execution time and create metrics out of them.

I’ve also shown us how to create x-ray traces with a single annotation.

At this point, my flooding of observability-related articles probably has some of you looking like this

a cartoon image of an exasperated man pulling on his hair is shown
Yes, I know some of you really are looking at this article like this guy (img src)

And yet, this is another article focused on observability. Why? Because it is the most important thing (besides security) that an enterprise system needs to be successful. If you look at Amazon and Google, they are hyper-focused on observability — you can find tons of talks about the topic, Amazon has two builder’s library articles about dashboards (which display metrics) and Amazon includes observability as a key point of the operational excellence pillar, which states a goal of, “…gain insight into their operations, and to continuously improve supporting processes and procedures to deliver business value”. Some ways observability can help deliver business value include:

  • the ability to make business decisions that maximize profit (you want a bigger bonus, right?). For example, you can look at the cost per order shipped, and determine that it is more profitable to ship a few big items, as opposed to many smaller ones on a given truck.
  • the ability to gain insight into user experience and minimize customer impact (if your api is taking 9 seconds to give a response because it makes synchronous calls, you now know this and can refactor it)
  • the ability to diagnose and remediate problems — when a problem occurs, you know have information to determine what is going wrong and how to fix it
  • the ability to run a/b tests — you can test two versions of your system to determine which is most desirable (i.e. if one version of your mobile app leads to increased sales conversions, then you should likely opt for that design)
  • the ability to make informed decisions about technology choices — for example, you can see that at scale, a nosql database like dynamo performs much faster and more consistent than a relational database at scale

Sadly, despite it’s foundational status, I find that observability is something that is often neglected by academia, coding bootcamps, internal training, and even in discussions among many “senior” software engineers and mid/upper management. An emphasis on observability is something I wish someone would have impressed upon me earlier in my career, as it would have saved me a lot of stress and helped me design higher quality systems. It is my hope that these articles will help you avoid the stumbling blocks that I encountered earlier in my career.

More Specifics

Okay, rant about observability over 😉. Let’s look at structured logging. Structured logging means we output our logs in json (or yaml) format, with specific attributes, such as correlation id and log level. I’ve included an example of a structured log below, but don’t worry too much about the specific information being logged, as there is more on this later in the “How should I format logs to get these benefits” section; the important thing is that it is written in json/yaml.

{
"correlationId": "9ac54d82-75e0-4f0d-ae3c-e84ca400b3bd",
"requestId": "58d9c96e-ae9f-43db-a353-c48e7a70bfa8",
"commitHash": "9d9154e",
"level": "INFO",
"message": "Returning response {statusCode: 200,body: []}"
}

Now that we’ve defined structured logging, as well as looked at a simple example, we will now see why you MUST implement structured logging for all of your logs (not just metrics you wish to publish with emf).

Then, in a subsequent article (soon to be published) we will see how I implemented structured logging using, aws lambda, lo4j2, lambda layers, and middleware (and the concepts apply to other languages as well).

Let’s See It In Action

First, I’m going to generate some load (750 requests) against my application using gatling. This load shows us the usefulness of structured logging, and how it functions in a high volume system.

Here you can see me executing the gatling simulation to generate load

Usefulness of structured logging

Now that we’ve simulated traffic to our application, let’s look at three situations which demonstrate the value of structured logging.

Situation one: an important customer (say a VP of operations) inquires, “Why did MY request fail?”

Consider this, your system is working really well. It is handling thousands of requests per minute, and it exceeds hundreds of thousands of requests per hour. All of a sudden, the VP of operations calls you and says, “I am running a really important client demo, and my request failed. Can you please see why and fix it ASAP? My correlation id is 30878b07–654b-4322-b7a3–0472e5b452c6” You open your logs, and realize you have hundreds of lambda instances active in the last hour. Searching through all these logs streams (even if you knew the exact time) is not really a viable solution.

You look at your dashboard (you do have dashboards, right?), and see your error rate is less than 1%, so his request was likely an anamoly. You search in cloudwatch logs to see his correlation id, and you see it was a network timeout interacting with dynamodb. You tell him it was just a transient failure and ask him to please try again (if he used your autogenerated open api client that includes automatic retries he wouldn’t be calling you 😉). He clicks submit again, and is happy that his request worked.

Without structured (json) logs, digging through these log streams would have been all but impossible. Even if some tools do allow for plain text searching, it is not going to be as efficient at scale (as querying indexed json is much quicker than regex matching). Here is a video of me searching in cloudwatch logs streams (notice the query syntax is different from the one I use later for cloudwatch logs insights).

notice how easy it is to search a ton of logs for those relevant to our correlation id

the query I ran was { $.contextMap.correlationId=”18d3107e-db33–4688-a60c-5d1b585ba649" } There will be more information on queries later in the article

Situation two: A cloudwatch alarm alerts you of a higher than normal error rate, and you want to find similarities among failed requests so you can do a root cause analysis

You are on call when a cloudwatch alarm notifies you that your system is experiencing higher than normal error rates. Since you are using aws sam with canary deploys, async lambdas with dlq, and automatic rollbacks, the system continues operating as normal. However, you need to figure out the root cause of the problem (out of band) and apply a fix.

Using cloudwatch logs insights, you can easily filter your logs for only errorred requests and gather context around why they failed. In this case, we are interested in the search query that was passed that caused the error, so you run the below query.

I search for errors, and notice the queries seem to be longer
fields @timestamp, @message
| filter level = "ERROR"
| limit 20

After looking at the searches that result in errors, you notice that many of them seem to be quite long, and thus you suspect lengthier queries might be causing the errors. In order to gather further evidence, you dig into cloudwatch logs insights and use it to select the minimum lengths of the searches that end in error. You now see that all of the queries ending in error in the last thirty minutes have a length longer than 66 characters. Armed with this evidence, you can now improve the reliability of the system by limiting the length of queries or by modifying the application to support longer searches.

the shortest failed query had a length of 66 characters
fields strlen(contextMap.searchQuery) as queryLength
| filter level = "ERROR"
| sort queryLength asc
| limit 10

Situation three: You are debugging an error in your application, but the stack trace doesn’t line up with the lines of code in your repo… so you wonder, what commit is this using anyways?

Consider that an alarm has gone off for your application. You have rollbacks in place, so production is working and there is minimal customer impact, but you are trying to debug the error in your codebase. It is showing a null pointer on line 102. However, your lambda code is only 94 lines long. After beating your head for a few minutes, you then realize that you could be looking at the wrong commit hash (as some of the code might still be “baking” in a lower environment, or production has a canary deploy, so traffic is split between two revisions). You quickly run a query to confirm that errors are occurring in this certain git commit hash, and now you are reasonably assured the error has already been resolved.

(no video here, I’ll leave it to you to figure this one out 😉)

Important Tip: there are different query formats depending on if you are querying a log stream or using cloudwatch logs insights

One thing that I always forget is that there are different syntax for searching logs depending on what part of cloudwatch you are in.

Cloudwatch log stream query examples

{ $.contextMap.correlationId=”18d3107e-db33–4688-a60c-5d1b585ba649" }{ $.level = “ERROR” } 

Cloudwatch log insights query examples

fields @timestamp, @message
| sort @timestamp desc
| filter correlationId = "88f38a99-0d61-44dc-b6e1-51d11a0deb73"
| limit 20
fields @timestamp, @message
| sort @timestamp desc
| filter level = "ERROR"
| limit 20
fields strlen(contextMap.searchQuery) as queryLength
| filter level = "ERROR"
| sort queryLength asc
| limit 10

How should I format logs to get these benefits?

Now that we’ve seen the benefits of structured logging, you are likely asking, “how should I format my logs? What should be included?”. First of all, logging should be done in a way that is consistent across services. From a practical perspective, that means that all logs should use properly formatted json (with double quotes surrounding all keys and string values, and including commas between fields). And this json MUST contain the following fields, regardless of the service or team:

  • correlation id — a unique id that is generated when an event reaches the system, and is propagated to all downstream services.
  • request id — a unique id that is generated for each request. Each call to downstream service should return a unique request id.
  • commit hash — the git commit hash for this component. Especially important in a world of canary deploys, and CI/CD
  • (log) level — an enum that denotes the importance of a message. Most logging frameworks define these for you (i.e. trace, debug, info, warn, error)

I always include these fields at the top of json in my logs before the message field. This is to remind myself that its important that I am looking at the correct request that is associated with the same transaction, and that I am referencing the correct version of the source code.

In addition to these fields, it is often useful (but not required) for teams to include additional json keys in their logs. They can do this by adding key-value pairs to the thread context (you can see examples of this in part two of this article). I would usually suggest treating this like annotations in x-ray — any field that you would like to search upon that can be safely recorded in logs should be added as a field. For example, it would probably make sense to include fields for the following attributes

  • order id for a shipping company
  • email address for a marketing company
  • transaction id for a banking company

Structured Logging vs X-ray

If you’ve read my article on x-ray, you might be thinking, “wait, why do I need this if I can use x-ray? Don’t they do the same thing?”. While they can accomplish many of the same things, keep in mind a few of the points I made in the x-ray article, namely that default x-ray sampling rate is only 5% (excluding the first request every minute). That means odds are for 95% of requests you want to view the traces for, they will not be available. If you choose to increase the sampling rate to 100%, then this can get expensive. Even if you were to pay to save traces for every event, it is difficult to derive metrics from x-ray compared to cloudwatch logs insights. Think about the above example where we determined the average length of a failing query. That was accomplished with a few lines of query code in cloudwatch console, but it would take hundreds of lines to do such a thing with x-ray (and would need an ide to write the code and such as well).

Hence, while x-ray is a useful tool; I would usually believe you can achieve a superset of its capabilities using cloudwatch logs insights in a more cost-optimized manner. Thus, I would suggest to always log the information to cloudwatch, and only supplement it with x-ray if you really feel it adds value.

Conclusion

In this article, we’ve been reminded the benefits of observability. We then focused in on how structured logging lends itself to more observable systems. We looked at three different examples of questions that can be answered using structured logging — looking for information on a specific request, finding commonality among errors, and confirming the git commit hash when errors occur, so we can use the stack trace line numbers most effectively. Finally, we have seen that while x-ray offers somewhat similar capabilities, we can achieve these capabilities (and more) using cloudwatch logs insights, all the while saving money.

I hope this article helps you design and operate more reliable, performant, and cost optimized systems. Be on the lookout for part two, in which I show how to implement these technologies using aws lambda, log4j2, lambda layers and middleware.

Related Links

Further Reading

Here Werner Vogels (long-time CTO of Amazon)discussed a lot about structured logging

--

--

Connor Butch

I write about coding, AWS mainly. Outside of that, I enjoy traveling, cooking, dogs, and meeting new people.