Logging Wisdom: How to Log

Published in

Unomaly Blog

11 min readMay 2, 2018

While the last post looked into why logging matters in the first place, this one will go a bit deeper and dig into the hows. During my past four years at Unomaly, I’ve worked with many companies and seen a lot of examples of how people produce application logs. In this post, I want to share that experience and give you some takeaways and practical tips for how to write better logs!

Much has already been written on this so in order to save precious space on the interwebs, and keep it new and interesting, I’ll refer to other places for the more already widely-accepted best practices in the Further Reading section at the end.

TL;DR

Since we live in the era of widespread attention and time deficiency, here’s the summarized version:

Log as if your program was writing a journal of its execution: major branching points, processes starting, etc., errors and other unusual events.
Structure your logs: Consider a hybrid approach: free text logs, with spliced in JSON or key-value pairs. Don’t invent new data formats
Try and avoid multiline logs, since they often get split up on the way to the log collector.
Never include sensitive user data in your logs, such as passwords or social security numbers.
Don’t swallow exceptions. Instead make them loud and let the severity of the event dictate the verbosity with which you log it.
When logging in the critical path, use sampling rather than aggregation.
Log in English. It is the lingua franca of software.

When to log

Think of logs as a journal for your program — that should be both readable by humans — and parsable by computers. It contains messages that describe what’s going on, along with the relevant context as key-value pairs. Also, keep in mind that it should be used for data mining purposes. Therefore — include relevant identifiers such as request ID’s, PID’s, user ID’s, etc.

But, why should you have to write a journal, when the program is already defined by its code and inputs? This is a very valid point. Why don’t we always store line-by-line traces of the program’s execution? It’d simply produce too much data — much of which is irrelevant. Logging allows you to chunk the operations of your program — similar to how comments can be used to explain whole sections of code.

Write your logs so that they tell your program’s flow of execution, at a at a level that is relevant to the whoever will be tasked with operating it. (the ops person, or you, in case of devops)

Some good places to add log statements are:

Major branching points in the code: e.g. for a web server, whether it’s a GET or POST request.
When errors or unexpected values are encountered
Significant business-specific events: e.g. that someone logged in, or some quota was reached.
Request failures and retries
Beginning and end of time-consuming batch operations

How to serialize your data

Let’s consider this example:

INFO pkgs.messaging sendMessage: sent message from jane to joe at time Fri, 20 Apr 2018 09:42: Hey Joe, where you goin’ with that logger of yours?

This is unparsable and will kill the spirits of good people. Why?

Non-standard date format.
Unquoted user data — making it inseparable from other meta data in the log.
There are no obvious key-value handles to use for parsing. You need custom regexes to extract any structured data from it — meaning it will typically not happen.

The problem is that when the developer wrote the above, it might have looked something like this:

logger.Info(“sent message from ${from} to ${to} at time ${t}: ${message}”, context)

Doesn’t look too bad, does it? The issue is that by serializing everything through a format string, all the metadata — i.e. what the different fields mean — is lost. Plus, even throughout the same codebase, you’ll often find many different ways of logging the same data.

Some structure is needed! However, structured logging is a bit like the Semantic Web: it’s a great idea and everyone likes it in theory, but the difficulty in establishing the necessary conventions make people shrug and stick to their old ways. “It was enough work putting together that web page — don’t tell me I have to annotate it as well.”

So, how do you write structured logs? First, pick a format. Few like XML (said noone), many like JSON. I’ll talk about doing structured logging in JSON, but most of the reasoning applies no matter what data format you choose.

Writing JSON logs

As easy as it sounds: “just write JSON logs” — still begs question of How? — one quite naive, yet not uncommon, way of doing it is:

{“type”: “INFO”, “message”: “INFO pkgs.messaging sendMessage: sent message from jane to joe at time Fri, 20 Apr 2018 09:42: Hey Joe, where you goin’ with that logger of yours?”}

This unfortunately does not improve much — the only piece of data that’s now in a structured form is the log level. The trick to writing good JSON logs is deciding a level of granularity that satisfies both developer flexibility, and the need for parsability. The other extreme is to make your whole log a properly formatted JSON structure:

{
    “type”: “INFO”,
    “module”: “social”,
    “method”: “sendMessage”,
    “event”: {
        “type”: “sent_successfully”,
        “from”: “jane”,
        “to”: “joe”,
        “message”: “Hey Joe, where you goin' with that logger of yours?”,
        “time”: 1524817341
    }
}

This allows for tools to automatically recognize the different pieces of information in the log as single tokens that can be subject to analysis, or categorization. In the best of worlds, I think the above is the way to write your logs. However, a common objection is that your logs become unreadable since they, without tooling, look like this:

{“type”:“INFO”,“module”:“social”, “method”:“sendMessage”, “event”:{ “type”:“sent_successfully”, “from”:“jane”, “to”: “joe”, “message”: “Hey Joe, where you goin’ with that logger of yours?”, “time”:1524817341 }}

Somewhat unreadable — which I think make some developers stick to unstructured logging. Also, I think there’s something easy about writing a free-text log as you write your code. Producing a consistent and neatly organized structure easily leads you into bikeshedding land, and the need for The One Way.

Here I want to present the Hybrid Approach: producing free-form logs with spliced in metadata in the form of JSON or key=value pairs, where relevant. The example could look like this:

INFO pkg=social method=sendMessage: successfully sent message {“from”: “jane”, “to”: “joe”, “time”: 1524817341, “message”: “Hey Joe, where you goin’ with that logger of yours?”}

Sure, the log-level, and the message type are not annotated. However, the number of choices that go into the above, is much lower, than if the whole log was written in an unstructured format. This makes keeping a consistent practice throughout a codebase more realistic. Another nice facet of this is that it makes the logs fairly easy to read without piping it through a JSON formatter.

Also, smart log analysis tools, will automatically pick up the key-values and JSON fragments it can find, and add it as metadata to the event, so it can be searched and indexed — which is often enough to properly categorize all events. (This is what we concluded when we reworked the Unomaly tokenizer, something we will probably cover in a later post.)

As event logs increasingly becomes a subject for data mining, I think more established conventions will emerge. However, at this date — I’d vote for the hybrid approach — since it has a very low threshold of adoption — it usually just means more consistently sticking to a few practices one already knows of.

If you’re including complex or hierarchical data in your logs, write those parts as JSON.
Use simple key=value, with quotes if needed, for simpler data. The key is succinctness, so wrapping it as a JSON just makes it harder to read.
Invent as little as possible — whenever you feel like logging a pipe-separated list of numbers to make it look like a matrix — just stop and take a walk.

Multiline logs and reasonably sized events

This is a matter of opinion, but I like to think of log messages as brief events — in the order of 100 to 1000s of character long. Instead of describing a large process happening in your program as single large events with tons of data, try to break it up into a number of smaller events each describing the bigger process as a number of steps.

Size in the order of a kilobyte.
Each event should have enough context to make sense on its own. You shouldn’t require looking at the surrounding events to understand what it means.
However, as long as there’s enough information in it to search for more info — such as a request ID — to find other related events.

Instead, think of your log messages as events. They should describe one thing, with enough context to make it possible to see what’s going on, or know where to go look for more information. If that internal data structure is needed for debugging — the logs won’t suffice, and they will have to hook a debugger into the program instead.

Some programs will dump huge internal data structures as log output. After spending a couple of hours feeling particularly clever figuring how to write a left-adjust format string, you produce a beautiful one-page table representation of your very important internal data structure. Developer experience improves, your team cheers and life is good.

However, in another location things aren’t quite as bright. An ops person is trying to figure out why things are broken, but only has this to work with:

oc=0x1A 1ffA 38bd 1f3d 85af 12fa d901 be02 […]

Your nicely formatted multi-line data dump was broken up and mutilated along the way. Multiline logs are supported in theory — but, in practice, most log routing setups get them wrong.

Therefore, it’s oftentimes good to avoid multiline logs. Or, wrapping your logs in a JSON makes the multiline problem go away — since that will make sure line breaks are escaped. However, the rule of keeping it brief and concise still remains — don’t log blob’s of JSON as single events.

Logging in the critical path by sampling

A common reason for not logging is request heavy processes where you couldn’t possibly do a bunch of logging for every request. Here, many just skip logging entirely and instead resort to reporting aggregates — such as periodic check: users_logged_in=10 with mean_session_time=12min. This is sure useful to see the overall health, but it’s meaningless when it comes to spotting bottleneck in individual requests.

A more powerful method is to sample: add log statements to your every request, but only output a X% sample of them. Each log statement could be wrapped in a shouldILog() type of statement that makes it a no-op unless it will be sampled for the particular request. Given enough requests, this will in time be enough to catch at least one instance of almost every single “type” of request — including those hitting tail latencies. Charity Majors makes the case for this in this talk. (I know I referenced that in my last post as well. I think she’s awesome!)

Don’t swallow the red event

“Never swallow exceptions!” — you might have heard this expression before. Again, if there’s one takeaway from this section, it’s that. To illustrate what this means, look at the following:

INFO: everything is fine
FATAL: universe reached heat death, errno=EINTR
INFO: flushed to database n_objects=10 time=30ms

Here, the reader of the logs will be dying to know what happened around second line. It’s apparent that something very serious happened, but nothing more is provided. This often happens through catch-all clause, that is followed by a log, and sometimes by a process.exit(), or worse, just that one thread crashing, turning the rest of the process into an episode of The Walking Dead.

Whenever you encounter an exception you should try and report everything of and around it. Often, the best thing to do is to just not handle it and let it rise to the top: panic, rethrow or whatever applies to your language. The default behavior of the language is usually to be very verbose about what happened. It’s more often when someone tries to capture it in order to rephrase it, that information is lost and turns into shit’s broken, exiting.

I’d like to conclude this section by proposing that: exceptional events should have exceptional consequences. If your program reaches a fatal state, you should provide more context to it, than you would for a more business-as-usual type of event.

Log levels and sensitive data

The syslog format has a number of levels for different types of logs. These are nicely defined in the standard, but you will often see them misused in practice. A good rule to follow is to consider what type of information you are outputting. Roughly, one can divide it into one of two categories:

Input data that the program works with, potentially highly sensitive. For example, a user’s social security number.
Data internal to the workings of the program itself — e.g. the current size of an LRU cache.

A common example of this is that that of cleartext passwords, a good program will mask them in the output, like this:

Connecting to database db=foobar username=root password=xxxxxx

However, note that this example blurs the line between the two with the inclusion the username. One has to practice a bit of common sense here.

A rule of thumb is: never include sensitive user data in your logs.

Also, make sure you are not running your production code in at DEBUG log level. As obvious as this might seem, I’ve seen it happen many times. Likely because someone was in a hurry to debug a broken process, and hardcoded it. There are many ways of doing this, but keeping your log level as part of your runtime configuration — that is different for dev and prod — prevents this.

“English ****, do you speak it?!”

You never know who, in some distant country, will one day have to look at your logs, and curse at you for using your, surely wonderful, but to many, unintelligible mother tongue. The Internet is (or at least should be) borderless — let’s make it as inclusive as possible. Therefore, write your logs in English.

Wrapping up

Hopefully, this has shed some lights on to the do’s and don’ts of logging! This is just my share of experience, so I’d love to hear yours as well. Please give a shout-out to @unomaly — we’d love to see your gems, as well as your nightmares when it comes to logging practices.

In the next post, we will talk about anomalies — how you can apply anomaly detection to your logs. This blog is not a supposed to be a marketing channel, but rather a product-agnostic place for our tech team to express ourselves and what we’re working on. So in the spirit of full disclosure: this is Unomaly’s core business, but I think this is relevant no matter what product you use for log analysis: looking at your data through a lense that only highlights things that are different than from before.

Thanks for reading, and stay tuned!

Peace, out.
– Emil