Observability: the new wave or buzzword?

OH — “Observability — because devs don’t like to do “monitoring” we need to package it in new nomenclature to make it palatable and trendy.”

— Cindy Sridharan (@copyconstruct) July 28, 2017

I think you’ll be hearing more about observability in 2018. In this post, I share why I believe the term has emerged and why you’ll be hearing more about observability vs. monitoring.

Failing Softly

Like the human body, our software systems have evolved over the past several years to fail hard less often:

Today, it’s significantly easier for engineering teams of all sizes to prevent hard failures: we have great caching tools to prevent overloading a database, our servers and containers will automatically recover or take themselves out of rotation on failure, and a single microservice may break while the main app remains usable.

Failure is evolving

Think about a traditional three-tier monolith: you may setup alerts on two key metrics: 95th percentile response time and error rate. If you receive an alert, very frequently the root cause is isolated to one of those three tiers (like the database).

Now, think about an app backed by several microservices: those services may reside on the same host (or may not), may share the same database (or may not), may reside in entirely different datacenters, or may even be owned by an outside vendor. Different teams likely own each of those services and choose the components they are most productive with, leading to massively heterogeneous systems.

When there are more ways for a system to fail, it takes more time to identify the cause of that failure. Most of the time spent resolving a performance issue is the investigation and not the actual implementation. These soft-failing systems can be more difficult to debug when things go wrong.

Failure is shifting to application code

Failure responsibility is shifting to developers

Enter observability

Monitoring tells you whether the system works. Observability lets you ask why it’s not working.

— Baron Schwartz (@xaprb) October 19, 2017

In Monitoring in the time of Cloud Native, Cindy Sridharan views monitoring (alerting and overview dashboards) as the tip of the observability iceberg with debugging beneath it. We’ve had alerting, dashboards, and debugging tools for decades, so why are we now deciding this needs an overarching name?

If we look at where the time is spent resolving a performance incident as the iceberg, a greater percentage of the iceberg is now in debugging:

As the composition of monitoring and debugging has shifted, we need an overarching container to describe the parts. This term is observability.

I think we see evidence of this is in the tooling that emerged first at large players that provide better debugging and or platforms that make it easier to understand the moving parts of distributed systems (Google’s Dapper, Facebook’s Canopy+Scuba, and Twitter’s Finagle).

It helps me to understand the relationship between the two parts of observability by asking these questions:

  • Monitoring — are you alerted on known failures and do you have a good starting point after receiving an alert?
  • Debugging — are you able to quickly isolate a problem using as few team members as possible?

Observability trends to watch for

  • Developer-focused tooling — Monitoring has traditionally been owned by operations, not developer teams. As more of the work in resolving incidents shifts to changes to application code, the tooling will need to shift focus as well.
  • Faceting on dimensions — Soft failures are isolated to a specific set of dimensions that can get lost on top-level monitoring views. However, grouping metric data by different facets (ie — show me the performance of our app for our largest customers) requires significant tradeoffs in terms of real-time availability, sampling, UI, and the amount of historical data that is stored.
  • Integrated tooling — Determining the level of integration between logging, metrics, and tracing.
  • DRY’ing up observability code — Having observability on a web request likely means it contains some level of logging, metrics, and tracing. Writing three separate pieces of code for this is painful. Concepts like OpenTracing may evolve to address this.

TL;DR

Also see

Originally published at blog.scoutapp.com on December 30, 2017.

Working on whisk-ml.org.