Breaking Down the Observability Problem for Distributed Systems

4 min readApr 3, 2018

Recently there have been many great posts helping to define with more precision distinctions between different tools and problems related to system observability. This is a helpful discussion and if you haven’t read this Monitoring and Observability post by Cindy Sridharan or this post by Earnest Mueller you should explore those first.

That said, if you are already hanging out in the Observability problem space you’ll realize it’s really big problem to solve that deserves clearer taxonomy and separation of concerns. Also this is the Cloud, so things don’t just work and it’s hard to parse good product from good evangelism.

It’s just Logging

This is both a helpful reality check, but also subject to many common misconceptions when applied to distributed systems.

Let’s start with the simple beauty of the truth of the statement however. It’s what building a culture of observability is really about. It starts with developers, just logging. A good platform should handle the rest. It’s easy to get caught up in metric formats, protobuf definitions, and push vs pull implementations but those are all derivatives of the log. These “innovations” are just ideas designed to solve many hard problems associated with distributed systems. And none of them are perfect or without tradeoff. Having debates about these solutions is what building your culture of observability is all about.

It’s more than just Logging

Because every software team working in the cloud shares a common set of challenges around observability, there are many open source projects, and closed source products to choose from. Because it’s a complicated ecosystem for many operators these products can create as many problems as they solve and a tremendous amount of time and money (toil) is spent planning and configuring these systems. To that end, I recommend “hiring products” that best solve the following distinct problems. Many products do all of these things, but most only do one or two particularly well.

To assess products operators should consider carefully their own behaviors when it comes to troubleshooting and monitoring. They should consider their team, and remember that the team of people running your software are (currently still) human beings. The industries you work in also will determine your regulatory and compliance requirements pertinent to your observability plan. I may in the future do a break down of different products, but this post is intended to focus on the problem space not the solution space.

Opinionated Structure

Once your teams understand the value of recording information to an immutable record the need for a sensical structure quickly emerges. This is because the vast majority of logs should be intended for machines to read, not humans. That is to say, systems should be designed to emit metrics, and the production of time-series metrics is fundamentally a subset of the idea of logging. Among metrics there are even further definitions that are sensical and common and that commonly emerge across platforms. Despite not being new ideas however, these are often still proprietary formats.

Many products conflate the need for an opinionated structure as part of the the transport, but I consider these two domains separate. There are opinions about your structure that may affect your transport but there are also transport protocol decisions to make that are not inherently tied to your log structure.

Transport and Routing

For logging in the cloud it’s important to transport logs from the source of production, to a consumer (more on those consumers in a second). Not transporting these logs immediately is a major source of information loss and information reliability is the star metric of the transport job.

Managing this transport is incredibly difficult. It requires knowledge or opinions about the rates of production and the rates of consumption and knobs for operators to twist when there is an in-balance. This means solving for service discovery and horizontal scaling (I see you Zookeeper).

Additionally if you are providing a platform as a service you will need to integrate with solutions for authentication and authorization. These details often influence the structure of your logs, and it can be easy to conflate the definition of the log structure with the transport needs.

Monitoring

Once you have routed and aggregated your isolated observations (individual logs) to a central piece of software you are going to want to do the cool stuff.

Charting and thresholds are somewhat commoditized at this point but monitoring is also where interesting things like machine learning happens. All that said the killer app for monitoring is notifying operators when things go wrong.

Escalation

While most of these tools focus on the technical problems with making observations, operations teams need to consider the actions required to repair the system. This often includes complex social information about who knows what, where they live (timezone) and when they be contacted or not (eg. vacation). It could also include machines that automatically scale infrastructure without human intervention.

Indexed & Cold Storage

There are use cases for auditing, and support reasons where humans looking at actual logs is still the best option. Because of this many operators like to have a set of logs available for open text searching. Additionally for compliance reasons some times log data must be preserved for several years. These are costly efforts in the cloud and operators need to consider carefully the cost and risk implications.

5 Distinct Problem Spaces of Observability

Serverless & The Future of Observability

Serverless is an exciting new innovation and trend that necessitates further and more precise understanding of these problem distinctions. Solutions that conflate these boundaries will struggle to adapt to a serverless paradigm, that platforms that have clear boundaries and opinionated solutions can provide highly reliable and scalable logging solutions.