Metrics Ain’t Monitoring

Cliff
The Opsee Blog
Published in
3 min readJan 29, 2016

--

This post on scaling in AWS made the rounds the other day and in general I thought it was a decent guide. The one thing I’d take exception to is the advice to only start monitoring once you have more than 500,000 users. To me this seems like an awfully long time to wait before setting up monitoring. Why wait that long? Upon reflection, I think the author is making the same mistake that I’ve seen most of the monitoring industry make for the past several years. That is, conflating monitoring with observability.

Observability

Observability products are organized around the use case of enabling exploration and visualization of telemetry. In practice, this usually looks like a wall of charts that is updated as new data comes in. The telemetry provides developers with a limited view into how their code is behaving; typically with a mix of standardized platform metrics and custom instrumentation manually added to the code.

The problem with observability products is that the scale involved forces some hard tradeoffs. Generally speaking, observability systems that offer historical analysis must trade that against query complexity. Systems that offer the richest analysis do so in a streaming fashion: they execute analytics in memory on the incoming stream of data and results are stored for further analysis later. The perennial problem with fixing production outages using observability systems is that you have to specify the data to be collected before it’s needed. In other words, if you didn’t anticipate needing a piece of telemetry then you will not have it during an outage scenario. This leads to either a kitchen sink philosophy of collecting everything possible from the outset, or a slow accretion of telemetry that gets added to after every production incident.

These scale and complexity problems are what make observability systems difficult to setup and expensive to maintain. And because the industry widely conflates observability with monitoring, we’re left with recommendations such as not setting up monitoring until reaching 500,000 users.

Monitoring

Monitoring, as opposed to observability, is more oriented to alerting developers that there is a problem to be investigated. To be clear, monitoring has some aspects in common with observability, in particular the collection of telemetry. However, monitoring is not necessarily about the deep dive analysis of individual components. Especially as distributed systems and microservice architectures become the norm, it’s much more valuable to clearly identify the impacted services. The monitoring system should provide a jumping off point for either performing a deeper diagnosis or a quick resolution to the incident. And as the transition to containerized and microservice architectures continues the ability to easily issue quick and simple fixes like restarts will become critical to easing the burden of being on-call for production infrastructure. Thus true north for any monitoring product is distilling data down into a crisp signal: is the system broken or not?

That’s why at Opsee we talk a lot about building a monitoring product that simply gets out of your way. If you can rely on only being alerted when things are well and truly broken, then you don’t have to sit and stare at graphs all day. Such a product would enable you to stop fighting fires and get back to shipping code, and that’s what we’re building. Opsee is in private beta right now, so sign up and reserve your spot.

--

--