Still learning about logging

Published in

Kingfisher-Technology

14 min readNov 29, 2023

In my first blog on this site, I mentioned a large monolith to microservices programme we are working on; before beginning development we spent some time thinking about what we wanted to do differently. One area that we knew could be improved was logging.

Where were we coming from?

The legacy system was an on-premise, multi-instance, monolith. Every instance could handle every kind of request, but each received just a portion of the total load. Starting and stopping these instances was (mostly) under our control — there was no auto-scaling and no chaos testing.

All the logs relating to a single request were generated by a single process. And because the processes were stable we were able to use session-pinning to improve performance, which meant that all logs generated in a single customer session were usually generated by the same single process.

Although in some cases the monolith used correlation IDs in interactions with other systems, those correlation IDs were not explicitly captured in log messages; some messages contained them by chance, but most messages did not include them.

The messages were unstructured plain text; and they were written by lots of different people over a long period of time. So there was a lot of inconsistency around what information log messages contained, how they were formatted, and when they were used.

Why were we unhappy with what we already had?

Person peering through a peephole — Photo by Dmitry Ratushny on Unsplash

Limited log retention

Historically development had been funded via (short-lived) projects; the future costs of on-premise log storage and the associated access control management were charged to the project teams, but the long-term future value mostly benefitted the separate (long-lived) support and operations team. This had given project teams an incentive to keep log retention to a minimum.

Furthermore, if we over-provisioned storage for logs, we couldn’t “unpurchase” unnecessary hardware. In fact, decommissioning would probably incur significant additional costs, potentially well in excess of the hardware resale value, particularly if it had held sensitive data which needed to be securely wiped.

If we under-provisioned, lead times for acquiring additional hardware were significant, and costs would impact project budget forecasts. But the pain (and cost) of acquiring additional hardware could be avoided by projects if they implemented log storage locally on the servers which created the logs. So typically this was the approach taken.

To prevent large log files from using up all the disk space, and causing crashes, the log files were capped to a maximum size and were automatically rotated (without being exported).
If too much time had passed between an event occurring and an investigation starting, there was a danger that the interesting logs would already have been overwritten by later messages.
This scenario was made more likely by the strict enforcement of segregated duties — developers did not have direct access to the production servers and logs. If a developer wanted to see the logs they had to ask the support / operations team to send them a copy, and this delay increased the risk of useful logs being lost.

Poor log quality

Because the messages were unstructured, and named things inconsistently, they were hard to search / filter; typically people had to resort to tools such as grep.

There were no IDs for any of the messages, so generating usable dashboards or metrics from them was almost impossible; instead we relied on our Application Performance Monitoring (APM) tool to expose metrics, typically based on exceptions.

Because the monolith was written in Java, error handling typically used exceptions rather than domain events / states. Often people would throw an exception rather than write a log message.

Since exceptions typically ended up producing stack traces, generating log messages dozens if not hundreds of lines long, this again made log rotation and message loss more likely, especially when the system was encountering unexpected conditions or errors.

Logging principles

As well as bearing these pain-points in mind, to help us work out what we wanted to do instead, we asked ourselves some fundamental questions about logging:

Why do we log?

Logs (along with metrics and traces) are a means of providing observability for our systems; they give insights into state and behaviour.

Although logs might capture system interactions and changes, they are not a formal audit mechanism.

Who are logs for?

Logs are for the benefit of the people operating and supporting the system; they are for the benefit of IT teams.

Logs are not intended for consumption by users. Notifications, response messages, and status codes are how a system communicates with its users.
Logs are not intended for consumption by business teams. Analytics provide the observability that business teams need.

Where should we store the logs?

We should be able to view all the logs relating to a single transaction / request (and ideally all logs relating to a single batch / session) in a single place regardless of which component generated the message.
Components might still have “local” logs containing just the messages they have generated, but these must be in addition to the combined view.

When should we log a message?

Support teams are generally interested in whether systems are operating effectively and as expected. They want to know when components start and stop; they want to know the progress of long-running tasks / batches; and they want to know when components cannot read their inputs or publish their outputs.

Support teams are generally not interested in log messages which describe the happy path of individual short-lived transactions / processes, especially when these occur in high volumes. Although support teams are interested in determining process failure rates, these can be more effectively tracked by dedicated metrics than by drawing inferences from logs.

Although logs are usually records of something that happened, sometimes we need to record what the code is about to do, because occasionally the failure mode for the next action is a system hang / crash which prevents the failure from being logged after it has occurred. This usage is rare in modern systems where typically a debugger or system tracing can be used instead of “just in case” log messages.

How should we format the logs?

Although logs are published as a sequence and might tell a story, we typically don’t want to read them from start to finish completely. Instead we want to be able to focus on specific areas of interest and ignore irrelevant messages.
Therefore messages must be structured (for example, in JSON) to allow easy parsing, filtering, and sorting.

To support this, the structured data must include:

a correlation ID, so we can understand which messages from different components relate to each other
a timestamp, so we can understand when things happened

If a timer’s tick resolution is too coarse (multiple events can be generated in a single tick), there might also be a need for an additional in-tick sequence / counter to record the relative order of events. But for us, timer ticks are fast enough.
Whether the source clocks need to be globally synchronised across all components, or whether locally consistent inside a single component is good enough, will depend on the system.
For us, in the case where a downstream component appears to have processed a transaction before an upstream component, we can safely infer that this is due to differing clocks rather than to a breakdown in causality (assuming that we trust our correlation ID implementation is working correctly, and the messages do not in fact refer to separate transactions / requests).

What information should we log?

When using logs as as a substitute for a debugger or a tracer, then terse messages (ideally unique) might be sufficient; but to meet our goal of providing observability we usually need to go further, we need to identify:

Where the log was generated
What event happened
Why that event is relevant to support teams

We should avoid or minimise the recording of sensitive data, such as Personally Identifiable Information (PII) or secrets / credentials (for example: passwords / API keys / access tokens). This reduces the effort needed to adequately protect data, and helps us meet compliance requirements such as GDPR legislation or PCI DSS.

Support teams do not read logs for amusement, they read them to understand whether they need to take any action and, if so, what action. They often need to triage quickly, and not always during normal working hours, so messages should be clear and explicit.
Support teams want relevant data about the single specific occurrence in the log message; however, if the message relates to a common scenario and that scenario requires a long explanation / description, it may make sense to document that scenario elsewhere, and to link to that document from the log, rather than include the explanation in every occurrence of the message and turn each message into an essay.

If a link is included, it should be a navigable link (for example a hyperlink) rather than merely a reference / title which imposes an additional burden (finding the document) onto the reader. As Steve Krug has it: “Don’t make me think”, especially in the middle of the night!

An opportunity for a new approach

So having identified the pain points of our existing approach, and determined our guiding principles for the new system, what did we decide to do?

Log storage

As part of the cloud infrastructure already in place at Kingfisher there was already a log aggregator in place. Sending all of our new logs there was an easy decision — not only was it already available and proven, but using the cloud meant we did not need to worry about forecasting the required capacity, we would always have the right amount. It also meant we could give broad access to logs without giving people access to production servers.

Furthermore, using an aggregator meant that for internal web apps we were able to capture client-side logs alongside back-end logs. We hoped that this would allow IT teams to quickly identify UI issues without having to wait for colleagues to report them. (Reporting a UI problem or UX frustration is not the top priority if there is a queue of customers waiting to be served; and by the time a colleague is free to report a problem they might have forgotten some of the details, making it harder for support teams to understand the issue.)

Log formatting

The log aggregator’s preferred structured format was JSON, so this was another easy decision.

For correlation IDs we decided to use W3C trace-ids. This was because they are both an open standard and one recognised by our APM tool. Support teams often use both the APM tool and logs to investigate issues, because each provides a different lens, so using the same identifier in both places makes it easy for teams to connect the different sets of data.

We also decided to recommend the use of human-readable message IDs.
The purpose of using IDs is to avoid breaking any monitors or dashboards as the actual content of log messages evolves over time.
The idea behind making the IDs human-readable was both so that the type of message could be understood at a glance, and so that any typos were unlikely to accidentally match another ID.

Log quality

We wanted the messages logged by the various components in the new system to be a much higher quality, and ideally actionable by the support teams. So we decided to treat messages almost like support tickets.

The guidelines that we suggested to help development teams understand what we thought good looked like were:

Messages should clearly indicate (via the log level) whether they represent just a state change or a (potential) problem
Messages should clearly indicate (via the ID) what kind of event they are. Example: invalid_page_size
Messages should contain information about the specific instance of the event. Example: “Page size requested: 1001”. (The information included should be in a dedicated field, not just in the main message text.)
Messages relating to a deviation from desired behaviour / state should indicate what the desired behaviour / state is. Example: “Page size requested: 1001 is out of expected range: 1–100” (Again the information included should be in a dedicated field, not just in the main message text.)
Messages relating to an unhappy path should include the following: possible impacts; possible causes & ways to distinguish between them; and possible remedies & when to use them. Example: “Product details repository request failed while retrieving product details for SKU with id=ab123. This limits the abilities of the Product pages. Please check product details repository availability. Other possible reasons: HTTP client issues, network issues.”

Signal-to-noise ratio

The use of log levels can significantly improve the signal-to-noise ratio, and enables support teams to easily filter out irrelevant information. However, it took us some time to get to a common view on what each log level should represent / be used for took, particularly the distinction between ERRORs and WARNINGs.

We agreed early on that ERRORrepresented a critical issue, but WARNINGdid not; however, that then raised the question of how do we determine criticality, particularly where the component encountering the issue might not be fully aware of the context in which it is being used.
For example: If we try to fetch serialised data from a cache, but fail to deserialise that data into a Kotlin object (perhaps the data model changed, and the cache holds a previous version), is this a critical error? From the point of view of the cache integration this is a critical error: it has failed to obtain the information from the cache and no recovery is possible. However, from the point of view of the attempted action this might just cause a delay rather than a complete failure: the information could instead be retrieved from a system of record (just as would happen for a cache miss).

The final guidelines we settled on were:

DEBUG — for debugging, but not for operating or supporting the system
INFO — for providing context, general information and expected state changes
WARNING — for undesirable situations, but ones that either have an automated resolution or no resolution is possible (thus no action is required from the support team), worth being aware of in case it is a precursor to or early indicator of a more serious issue
ERROR— for undesirable situations with no automated resolution, these always require an action from the support team (even if that is to just raise a formal bug ticket for a code change)

Controlling costs

Logging consumes system resources: CPU, IO, and storage. So any time we log we are making a bet that the future value of the log message outweighs the immediate cost of creating the message and the ongoing cost of retaining it.

Historically one way of hedging these bets about costs was by switching off lower log levels (such as DEBUG) by default. If a situation arose where the missing logs were temporarily wanted, then they would be temporarily enabled. This has a number of drawbacks:

If the issue is transient / rare it might not recur while the additional logging is active.
If the configuration is hardcoded then changing it would require deploying a new code version, which might unintentionally contain additional code or configuration changes.
Similarly, if the configuration is held separately from the code but only read on system start-up, a restart might also unintentionally change more than just the log levels (for example, caches might be cleared or refreshed), which could also cause system behaviour to change.
Heisenbugs — if the issue being investigated is affected by timing issues, then switching on the logging might alter timings enough to cause the system to behave differently. In frustrating cases this can result in a system behaving as expected when heavy logging is active, but misbehaving / behaving unexpectedly when only light logging is active.

Although we expected that in the cloud both the immediate and the ongoing costs would be cheaper (or at least more easily adjusted), and that therefore we could afford to log more, costs still needed to be considered. Helpfully our log aggregator charges us based on two separate metrics:

A bulk storage cost (minimal), based on total size of data ingested
An indexing cost (significant), based on the number of messages indexed for searching and filtering

This allowed us to send all the logs (including DEBUG) to the aggregator without incurring too much cost, and by excluding the DEBUGmessages (the majority of the logs) from the index we avoided any further costs.
Although the unindexed messages are not directly searchable we can retrieve them from bulk storage then search / filter locally if we need to (if the issue is not reproducible). Alternatively, if we know the issue will repeat we can temporarily enable indexing of all messages while we investigate the issue.

What benefits have we achieved?

The new logs are:

Quicker to access, both for support teams and for developers (no need to get permission to access production servers)
Easier to search and filter (stack traces are a multi-line attribute rather than separate log messages)
Retained for longer
Linked to the data in our APM tool
Better quality
Safer to change (because of the use of message IDs)

Furthermore our logging cost scales smoothly, rather than in big steps; and it is borne by the teams that derive the most benefit from the logs, who are in the best position to compare cost versus value.

What challenges have we encountered?

Previously our costs were generally higher, but they were fixed; with the pay-per-use model in the cloud it is possible for log-spamming rogue processes to generate significant costs.

Previously access to the logs was restricted, particularly in production; although this was frustrating from a development and support point of view, it did mean that any accidental recording of sensitive information (such as PII or credentials) would have limited exposure due to the small potential audience.

Although we tried to set standards for log messages, some messages come from third-party SDKs or open source libraries, which have their own standards. Even where we have the ability to directly modify these libraries to conform to our standards, the ongoing cost of maintaining these modifications as new versions of the libraries are released would be more than we want to incur.
Fortunately some of the SDKs are somewhat configurable. And for the others, the logging framework we use allows us to at least override their log levels, so we can align those even if we cannot change the message contents.

What lessons have we learned?

We needed to put additional monitoring in place (in production and test environments) to quickly identify any:

processes generating huge volumes of indexed logs, to prevent them from running up significant costs
sensitive data in the log messages

Where it was impossible to exclude sensitive data from the logs (either because the support team needed it to diagnose issues, or because the underlying technology could not be configured to suppress it) we had to implement role-based access control to limit the visibility of those specific logs.

Next steps / The future

An open road, with the sun rising ahead — Photo by Tim Foster on Unsplash

Using log monitoring to detect sensitive data in messages is not 100% reliable, and ideally, these checks would be done earlier in the development lifecycle.
Chris Imrie has suggested that just as we should be aware of when we are dealing with unsanitised data so that we don’t create SQL injection or cross-site-scripting vulnerabilities, we should equally be aware that we don’t leak sensitive data in log messages. So if we are interacting with a component or third-party which holds sensitive data, we should not log the raw data they provide (at least not in production, even if we need it for debugging in development and test environments).
One idea we are considering is whether it would be possible to somehow tag or annotate types as safe-to-log, and use some build-time tool to ensure that log messages only include safe-to-log types. Options we are considering for this include; ArchUnit, the Kotlin Symbol Processing API, Arrow Analysis, and Manifold Systems.

OpenTelemetry is starting to move into the logging space as well as covering metrics and tracing. Combined with log aggregators and APM tools expanding into each other’s territory, this may mean in future we can unify all aspects of observability based on a single open standard.

So it seems that I still have more to learn about logging.

If you are interested in joining us on our journey, please check out our careers page.