On Observability and Logging

Emil Stenqvist
Unomaly Blog
Published in
5 min readMar 27, 2018

It has been argued that the digital revolution — and in particular the Internet — is a breakthrough in human history comparable to the invention of written language, or the discovery of vaccines. It has fundamentally changed how we live and interact. However, contemporary news is filled with stories on how software fails or is exploited: everything from leaked credit information, military sabotage, to the Mars rover Spirit almost being killed by a software bug. It seems that the responsibilities of software has surpassed our ability to monitor and safeguard it.

For the past four years I have worked for a company called Unomaly, that does algorithmic log analysis. This has given me and my team a lot of experience with how organisations monitor their computer systems. This post is the first in a series on observability and how log data is an essential piece in that: a lowest common denominator, a signal that almost every piece of software emits — which we can listen to.

Monitoring and Observability

The task of producing good software and making it run reliably is associated with a plethora of words and concepts: monitoring, log analysis, pen.testing, auditing, metrics, reliability engineering, etc. However, something that is central to all of this is observability — that is, how well you can infer the internal state of your systems by listening to the external outputs and signals they produce.

Twitter has a blog series, where they suggest a definition of observability, in the case of software systems, as consisting of:

  1. Monitoring
  2. Alerting/visualisation
  3. Distributed systems tracing infrastructure
  4. Log aggregation/analytics

Cindy Sridharan has an excellent post that further delves into this topic. One central conclusion is that monitoring differs from the others in that it’s symptoms based — with checks such “Is the site up or down?”, and will only notify you when something’s already broken rather than proactively help you detect it.

Being able to do full tracing is obviously the holy grail — but it’s tied to specific tech stacks, and hence has a higher threshold for starting out. Also, you can’t really do it for things you didn’t develop yourself. What about metrics? Metrics are numbers, such as the average request rate or the no. of times a database table has been accessed. They are fundamentally different from logs in that they are aggregates. Charity Majors held a great talk about this at the Strangeloop Conference, with one of her points being that metrics don’t allow you to ask new questions — as opposed to logs, which instead tell stories of events, from which you can ask many different questions after they’ve been recorded.

The Universality of Logs

Logging is simply programs emitting lines of text, detailing what it’s doing — it can be informative, such as reporting progress on a batch job, as well a reporting problems such as hardware failure, or break-in attempts. Many different devices and systems emit logs: network equipment, a VM hypervisor, the Linux kernel, your JVM, as well as your applications. You’ve probably seen billboards that broke and spilled its guts in the form of pseudo human-readable text — its logs!

They can read like friendly messages:

sshd Accepted publickey for jane from 10.46.250.46 port 58912

Or look more like data dumps:

ACPI: XSDT 0x00000000FC00DDC0 000054 (v01 Xen HVM 00000000 HVML 00000000)

They can be structured, e.g. in the form of JSON:

{“remoteIP”:”127.0.0.1", “host”:”localhost”, “request”:”/index.html”, “query”:””, “method”:”GET”, “status”:”200", “userAgent”: “ApacheBench/2.3”, “referer”:”-” }

Producing textual logs dates all the way back to the 1970’s and the birth of modern computer systems. I found this magical piece of computer history: a man page from UNIX V1, by no less than Dennis Ritchie (dmr) and Ken Thompson (ken) — the creators of C and UNIX — writing currently logged in users /tmp/utmp. Not exactly logging per se, but something of an embryo!

Logging carries a long legacy, and almost all programs produce them in one form or another. And this is where their power lies: modern software is composed of many existing parts, developed independently of each other, according to different paradigms and trends. In result, agreeing on one way of exposing structured metrics or API’s for doing runtime inspections is practically impossible. Therefore, logging is a lowest common denominator, shared by essentially all software — the most basic artefact one should start utilising before moving to anything else.

Loggers, centralize!

Since we no longer live in the mainframe era (even though, in some ways, we seem to be moving there again), there’s a need to collect logs from many different systems and store them in a central location. For this purpose, Eric Allman invented syslog around 1980 — the first standard for logging. Most importantly, it separated the concerns of producing log messages from that of transporting and storing them. Also it labels each log message with a source host, timestamp, plus some more metadata.

<34>1 2003–10–11T22:14:15.003Z mymachine.example.com su — ID47 — BOM’su root’ failed for lonvick on /dev/pts/8

And even though much work has been done since, with more sophisticated systems for collecting and storing log data, most remain compatible with the syslog protocol.

Wrapping up

With this post, I want to make the case that logs are an under-utilised asset. First of all, they provide a more unfiltered insight into a program’s execution than other methods. If it’s verbose enough, you can almost follow the actual code path by just looking at the logs. Secondly, most developers will naturally write logs as part of their workflow. Every developer has at some point logged into a production system that is down, in order to look at the logs and figure out what happened.

Logs can be mined for patterns that are not obvious when taken at face value: events tagged as fatal can be part of normal operation, and conversely: an informational log messages can be a signal that something very serious is about to happen. Also, looking for anomalies in the logs — something we’ll cover in a later post — can help you spot very subtle clues of unwanted behaviour, such as breaches, bugs or hardware errors.

In the next post we will dig more deeply into how to actually write good logs, e.g. what they should and should not contain. Thanks for reading, and stay tuned!

PS. in the meantime, check us out at unomaly.com!

--

--