Operational Focus: Why Symptoms, not Causes?

Aron Eidelman
Google Cloud - Community
5 min readNov 3, 2022

“Users don’t care why something is not working, but that it is not working.”

How can we turn this platitude into something that helps Ops teams?

Let’s start with a traditional model, where Ops focuses on infrastructure, and we wait for customers to tell us something is wrong:

A set of hugging-faces representing customers stand before a box titled “promises to users.” The box connects to a larger box with a quesiton mark, representing a chain of unknown causes, that all lead to a final box titled “infrastructure.” Next to the infra box is a cat, representing Ops. Above the diagram is a smily face representing the business.

Let’s consider the worst-case scenario in this traditional state.

Users experience an issue: the business made a promise to users, and it isn’t coming true. But the infrastructure is fine.

Users don’t care.

The Ops team may only have a small, partial view, and this partial view leads to another potential issue.

Say things are going well for the business, and more users start using their service.

Traditional Ops might be panicking even when something good is happening for users. And they might have a legitimate reason to be concerned!

Operations is ultimately a business problem, not just a technical one.

We need to be able to see the causal chain between different layers of a system.

We see a chain of dependencies surfacing differently as a mix of clear and ambiguous causes.

We also see layers of redundancy that allow for lower-level infrastructure failures without impacting users.

Moving from this conceptual awareness, you can think of how to identify and measure different areas of interest. Based on how apparent they are to users, we can group them into symptoms and causes.

Now that we have a model of the causal order, Ops can focus more on the same area of concern as the rest of the business: the users.

When issues arise, starting from a few symptoms, Ops can find the cause more efficiently than before.

But if we know that causes precede symptoms, don’t we want to know when causes start to look wrong in advance?

Isn’t a symptoms-first approach more reactive and not as predictive, regardless of if we know a causal chain?

These are valid concerns if causes are as powerful as before and if we still need to do more to mitigate the impact of a failure deep within our system.

So suppose instead of those mitigations, we alert on causes.

We run a risk of being overwhelmed with causal failures. Alert fatigue and a high noise-to-signal ratio do not help us fix things faster.

Firefighting hardly seems more manageable if we’re merely aware of more fires.

How do we get out of this mess?

Ideally, we would ask, “What would it take to only alert on symptoms and not causes?”

We would build in layers of automation that obviate the need for alerts.

Why? Because alerts need to be actionable, we should have a system ready to handle failure.

With the ultimate goal of turning off alerts for causes, we automate as much as possible and progressively move closer to just the symptoms.

Even in tossing away alerts, at no point are we turning off monitoring.

We still need to monitor causes for troubleshooting, cost control, and so forth–but we are increasingly confident in our ability to focus on the symptoms primarily.

Even with automation and monitoring in place, we had accepted earlier that any technical system guaranteed some failures.

Beyond the types of failures that we can prepare for, there are still unknown potential causes.

With a pattern for handling newly discovered causes, we avoid the need to obsess over them.

A bit of project work saves us from a lot of future toil. In a little time, we can return our focus to users. But we do it with the expectation that failure is inevitable, and we’re ready to discover future unknown causes.

Apply this perspective to orient discussions about expected improvements to Ops.

Think when an IT leader says, “We want complete, end-to-end visibility.”

In that case, though, what is the main priority?

“We want to be aware when something goes wrong.”

If you’ve designed a system to handle failure, what does it mean to “go wrong?”

There is a provocative way to get people to think about these issues:

“Starting tomorrow, turn off all alerts except for user-facing symptoms. Any objections?”

You will get a litany of dependencies, a lack of redundancy, and gaps in monitoring. It would be too abrupt to make this move all at once.

The point is really to ask:

“What will it take to work towards that ideal state?”

It’s up to Ops to care more about why something isn’t working–even if users don’t. The change in perspective here isn’t merely about transitively caring about the same things; empathy is only a starting point.

Instead, what a user-centric perspective gives us is a different set of values:

  • There are more possible causes of issues in our system than possible moves in chess; accept the ambiguity and focus on the most relevant.
  • What started as “business concerns” may result in discovering new technical issues that we didn’t previously see.
  • Starting with users and alerting Ops on symptoms is the sanest way to approach debugging. Alerting exclusively on symptoms should be our goal.
  • Automation isn’t a side project or a luxury. It’s the best means to obtain our goal confidently.

Happy hunting!

--

--

Aron Eidelman
Google Cloud - Community

DevSecOps at Google, Board Chair at Azure Printed Homes, Dadalorian at Home