Logs and Metrics

At last Wednesday’s Go meetup, one of the speakers mentioned:

This wasn’t posed as a question. This was a statement.

This wasn’t the first time I was hearing this. A lot of people conflate the two or mistake one for the other, perhaps because many monitoring tool vendors claim to offer both. To be fair, there is a certain amount of overlap between the two even if they differ in enough ways to be considered separate entities in their own right.

The Primacy of Observability

Observability is what is required in order to gain visibility into the behavior of modern day applications and infrastructure. Why, some go so far as to say that observability is more important than even unit tests.

For one thing, unit tests only ever test the behavior of a system against a specified set of inputs. For another, tests are conducted in very controlled (often heavily mocked) environments. While the very few who do fuzz their code benefit from having their code tested against a set of randomly generated input, fuzzing can only comprehensively test against the set of inputs to one service. End-to-end testing might allow for some degree of holistic testing of the system and fault injection/chaos engineering might help us gain a reasonable degree of confidence about our system’s ability to withstand these failures, but complex systems fail in complex ways and there’s is no testing under the sun that enables one to predict every last vector that could contribute towards a failure.

Observability tools, on the other hand, tell the truth about exactly how a system behaved in the face of:

1. the elasticity of the production environment
2. the unpredictability of inputs
3. the vagaries of both upstream and downstream dependencies

Traditionally, much of alerting was derived from blackbox monitoring methods. Blackbox monitoring generally refers to observing a system from the outside — think Nagios style checks. Blackbox monitoring is useful in being able to identify the symptoms of a problem (site is down or DNS is not resolving) but not the root cause (site down because a primary database’s data directory accidentally got rm — rf'ed or DNS not resolving because a hosted DNS provider is facing a DDoS attack).

Both logs as well as metrics, on the other hand, are examples of whitebox monitoring. Whitebox monitoring refers to a category of monitoring tools and techniques that work with data reported from the internals of a system. Together with distributed request tracing, metrics and logs form the three pillars of observability.

With some parts (or even all) of infrastructure increasingly being outsourced to SaaS tools that can only be monitored from the outside, blackbox monitoring still has its place. However, for systems internal to an organization, alerts derived from blackbox monitoring techniques are slowly but surely falling out of favor, as the data gathered from the internals of an application can result in far more meaningful and actionable alerts compared to alerts derived from external pings.

Alerting is only one example of how we can put to best use the data collected from the internals of systems. Whitebox monitoring is key to all manners of other functionality as well, such as debugging, dashboarding, predictive analysis, exploration of the long tail and anomaly detection to name a few.

Some people hold the opinion that every log message should be a metric. Others believe that there is a clear demarcation between what falls within the ambit of logging and what is better off instrumented as a metric. This post isn’t going to champion the use of one over the other, but rather aims to provide an honest look at the costs, benefits and challenges involved in collecting and analyzing both logs and metrics.

Because, as it so happens, the difference between logs and metrics isn’t just “where you send them to”.

Logs

A log is an immutable record of discrete events that happened over time. Some people take the view that events are distinct compared to logs, but I’d argue that for all intents and purposes they can be used interchangeably.

Logs in general come in three forms:

1. Plaintext — A log record might take the form of free-form text. This is also the most common format of logs.
2. Structured — Much evangelized and advocated for in recent days. Typically this is logs emitted in the JSON format.
3. Binary — think logs in the Protobuf format, MySQL binlogs used for replication and point-in-time recovery, systemd journal logs, the pflog format used by the BSD firewall pf which often serves as a frontend to tcpdump.

There are a number of sources from which these logs can originate such as:

1. Application code instrumented by developers
2. Application middleware like Rack that can be specifically configured for this purpose
3. Application servers and reverse proxies that come with their own custom log formats and can be configured to emit a ton of extremely useful information about every incoming HTTP request
4. Database replication logs such as MySQL’s binlog or MongoDB’s oplog 
5. Message brokers like Kafka or NSQ
6. Caches like memcached and Redis which have their own logging format
7. Load balancers like HAProxy
8. Firewalls, routers, switches
9. Most SaaS/PaaS/IaaS/CaaS services

#9 notwithstanding, logs are usually forwarded to a centralized log aggregator or parser like Scribe/heka/logstash to then be transformed and preprocessed before feeding it into a data store for further processing and indexing.

One of the biggest benefits of logs is that they are omniscient about every single request that an HTTP application serviced or every single packet that entered or left a network interface card. The information present in a single log line can be queried with a tool as simple as grep(in the case of plaintext logging) or jq, or alternatively, can be processed, indexed and persisted for more comprehensive analysis. In my experience, there’s never been a time when I haven’t been able to find an answer to a question about a recent request with a simple tail -f foo.log | ag "blah" .

With structured logging, one enjoys the latitude to capture just about anything that one might perceive to be of interest. For this reason, such type of log data can support high dimensionality, the sort that is great for things like:

1. exploratory analysis of outliers
2. auditing
3. analytics like measuring revenue, billing, user engagement
4. real time fraud analysis
5. DDoS detection

Logs, in particular, shine when comes to providing valuable insight along with ample context into the long tail that averages and percentiles don’t surface. There’s a famous paper from Google about the tail at scale, which goes into great detail about how tail-latencies that fall outside the P99 range can cause an adverse effect on the end user experience. The paper is well worth a read, because it also proceeds to provide many tips for circumventing the P99 tail latency.

By far, the biggest drawbacks of logs is how operationally and monetarily expensive they are to process and store. Albeit structured logging doesn’t suffer from some of the downsides inherent in pure string-based log manipulation, it’s still far more voluminous to store as well as slower to query and process, as evident from the pricing models of log (event) based monitoring vendors.

There’s ELK in the open source space, but no one I know likes or wants to operate ELK. Even if your organization is staffed with a team of Operations engineers who are experts in operating ELK, there might be other drawbacks. Case in point — one of my friends was telling me about how he would often see a sharp downward slope in the graphs in Kibana, not because traffic to the service was dropping but because ELK couldn’t keep up with the indexing of the sheer volume of data being thrown at it. Another friend of mine who was an SRE at Google told me that:

Google had huge clusters just for logs, so for jobs that needed their logs saved, there was a logsaver running alongside the main task in the alloc (what you’d call a pod in k8s), and it would ship the logs to a log storage cell. Google’s production offline storage team actually also backed up all logs including all the ad stat/click logs, which were huuuuuuuge. Not all jobs had their logs persisted remotely; some just logged locally for diagnostics, like for instance the many jobs that weren’t handling user requests but just some infrastructure tasks. These logs were lying around in the borglet for a while to inspect after a crash or a problem, then cleaned up automatically later.

With more and more organizations adopting increasingly complex architectures, problems caused by the long tail can no longer be ignored as something that only companies like Google face. However, it’s also true that unless you’re a Google or a Facebook, you probably don’t have the aforementioned infrastructure to support ingesting, processing and persisting such a colossal volume of logs. Even if you’re not interested in logs for anything more than running a primitive tail -f, not automating (or even setting up) the process of log rotation or failing to have any sort of monitoring on the size of the /var/log mount has the ability to take the entire host down. Logs also have the potential to greatly drive up bandwidth costs if left unchecked.

Furthermore, without sophisticated tooling to aggregate and de-duplicate them, logs can get painfully noisy to sift through or plain repetitive, especially in the case when an application is throwing a ton of exceptions or errors after a bad deploy. Exception trackers (I think of these as logs++) have come a long way in the last few years and provide a far superior UI than a plaintext file or blobs of JSON to inspect exceptions. Exception trackers also provide full tracebacks, local variables, inputs at every subroutine or method invocation call, frequency of occurrence of the error/exception and other metadata invaluable for debugging. Exception trackers aim to do one thing — track exceptions and application crashes — and they tend to do this really well. While they don’t eliminate the need for logs, exception trackers augment logs — if you’ll pardon the pun — exceptionally well.

“Only log actionable data” might sound great in theory, but debug logging is sometimes a necessary evil, something that is only ever useful to someone who has a deep understanding of the codebase or runtime. Like for instance, when I’m debugging something, I might already know the symptom (such as a crash or increased timeouts from a service) and maybe even the root cause (malformed input to the application or unexpected event loop stalls), and what I might really be aiming to do is connect the dots between root cause and symptom by retracing the code path traversed while servicing the request. Adding such debug logging might not be a problem if one is logging to a local file on disk that gets rotated every so often, but it very much becomes a problem if these logs are being shipped to a vendor with a $$/GB/month pricing model.

Another potential issue with logging is the performance and idiosyncrasies of the logging library used. Most performant logging libraries allocate very little, if any, and are extremely fast. However, the default logging libraries of many languages and frameworks are not the cream of the crop, which means the application as a whole becomes susceptible to suboptimal performance due to the overhead of logging. Additionally, log messages can also be lost unless one uses a protocol like RELP to guarantee reliable delivery of messages. This becomes especially important of one is using log data for billing or payment purposes.

Last, while logs can most definitely be used for alerting, I find alerting based on log data to be not terribly ideal. Alerts should always be actionable (corollary — only log/measure actionable data) and quick, which might not always be possible with data derived from log analysis.

Metrics

The official definition of metrics is:

a set of numbers that give information about a particular process or activity.

Metrics are usually measured over intervals of time — in other words, a time series. The official definition of time series :

a list of numbers relating to a particular activity, which is recorded at regular periods of time and then studied. Time series are typically used to study, for example, sales, orders, income, etc.

Since metrics are just numbers measured over intervals of time, they can be compressed, stored, processed and retrieved far more efficiently than logs representing plaintext records or JSON blobs. Metrics are optimized for storage and enable longer retention of data, which can in turn be used to build dashboards to reflect historical trends. Additionally, metrics better allow for gradual reduction of data resolution over time, so that after a certain period of time data can be aggregated into daily or weekly frequency.

One of the biggest drawback of historical time series databases has been the representation of metrics which didn’t lend itself very well toward exploratory analysis or filtering. The hierarchical metric model and the lack of tags or labels in systems like Graphite especially hurt in this regard. Modern monitoring systems like Prometheus represent every time series using a metric name as well as additional key-value pairs called labels. This allows for a high degree of dimensionality in the data model. A metric is identified using both the metric name and the labels. Metrics in Prometheus are immutable; changing the name of the metric or adding or removing a label will result in a new time series. The actual data stored in the time-series is called a sample and it consists of two components — a float64 value and a millisecond precision timestamp.

By and large, the biggest advantage of metrics based monitoring over logs is the fact that unlike log generation and storage, metrics transfer and storage has a constant overhead. Unlike logs, the cost of metrics doesn’t increase in lockstep with user traffic or any other system activity that could result in a sharp uptick in observability data.

What this means is that with metrics, an increase in traffic to an application will not incur a significant increase in disk utilization, processing complexity, speed of visualization and operational costs the way logs do. Metrics storage increases with the number of time series being captured (when more hosts/containers are spun up, or when new services get added or when existing services are instrumented more), but unlike statsd clients that send a UDP packet every time a metric is recorded to the statsd daemon (resulting in a directly proportional increase in the number of metrics being submitted to statsd compared to the traffic being reported on!), client libraries of systems like Prometheus aggregate time series samples in-process and submit them to the Prometheus server upon a successful scrape (which happens once every few seconds and can be configured).

Metrics collection with StatsD
Metrics collection with Prometheus

Metrics, once collected, are also more malleable to mathematical and statistical transformations such as sampling, aggregation, summarization and correlation, which make it better suited for monitoring and profiling purposes. Metrics are also better suited to trigger alerts, since running queries against an in-memory time series database is far more efficient than running a query against a distributed system like ELK and then aggregating the results before deciding if an alert needs to be triggered. Of course, systems like Facebook’s Scuba strictly query only in-memory data, but the operational overhead of running a Scuba-esque system, even if it were open source, isn’t something worth the trouble for most.

What metrics to collect?

Tom Wilkie proposed the RED method as a guideline for instrumenting applications. This also is what SRE’s at Google monitor for Google’s own API products as well.

The Prometheus server itself is agnostic to the type of metrics and treats all of them as untyped time series data, but the Prometheus client libraries support four metric types — counters , histograms, summaries, gauges, which together are sufficient to collect both RED and USE metrics.

Request rate and error rate (the RE of the RED/E of the USE) can be obtained using counters. The saturation (S of USE) can be obtained using gauges. The duration of requests (D of the RED) can be measured using histograms.

If you’re interested in learning more about these:

Conclusion

Logs and metrics are complementary. There exist closed source systems that are impossible to instrument, in which case, logs are one’s best bet to gain some semblance of observability into such systems. Tools like mtail allow logs to be converted into metrics by parsing syslog style messages and emitting time-series. Log/event processing often requires setting up a streaming infrastructure like Kafka streams or the now defunct heka. Stream processing can furnish a treasure trove of extremely valuable real-time insights but warrants an entire engineering team in its own right. For plain observability purposes, exception trackers and distributed tracing (both which belong to the logging family) are far better suited.

Many small to mid-sized companies I know have chosen to pay a vendor a small fortune than maintain their own logging infrastructure. While Nagios style monitoring systems might’ve been a nightmare to operate, new time-series databases like the one in Prometheus were built from the ground up with ease of operation in mind (single Go binary, no clustering), which make it possible for even smaller companies to operate it without much hassle. While these systems don’t obviate the need for log processing and storage, they do make it easy to get up and running with a fairly sophisticated monitoring stack.