Stop getting paged for useless alerts

Annika Garbers
SignifAI
Published in
3 min readOct 4, 2018

If you’ve ever been an on-call SRE, you know the struggle of pager fatigue. The everyday balance between maintaining existing infrastructure and developing sustainable solutions is tough enough, but add a constant stream of alerts and it can get exhausting pretty quickly!

In an ideal world, on-call engineers would get alerted for only the production issues that are both urgent and important. However, the Eisenhower matrix distribution of many production systems’ alerts looks more like this:

This means that more than 75% of your time, the notifications stressing you out aren’t worth the sweat. So where is all of this noise coming from?

Irrelevant alerts
Unused services, decommissioned projects, and issues that are actively being handled by other teams are some sources of alert noise that are just prevalent enough to annoy you, but not always enough to go through the legwork of turning them off at their source. These notifications come from all kinds of places in your production system and tend to get quickly “acked” but largely ignored, since there usually isn’t an underlying actionable issue.

Low-priority alerts
Some noisemakers indicate problems that may eventually need to be addressed, but are low on the priority list. Keeping these alerts configured can be a useful reminder to investigate or address the root cause of the issues eventually, but in the short-term, they’re probably not adding value to you.

Flapping alerts
Acking flapping issues can feel like playing whack-a-mole. These alerts are a good indicator of a growing problem in your system but can be a source of distraction when you’re trying to problem-solve, sometimes prompting SREs to temporarily silence pages or blindly acknowledge all incoming issues. Unrelated issues can sometimes get lost in piles of flapping notifications, which can be a sneaky risk to your team’s ability to notice important problems.

Duplicate alerts
Similar to flapping alerts, but more a symptom of aggressive monitoring configuration than an underlying production issue, duplicate alerts can be another source of pager fatigue. You’re aware of the problem after the first notification, so additional alerts letting you know that it’s still there can just add frustration.

Correlated alerts
These are the toughest but possibly most important sources of noise to identify. Getting to the root cause of issues is way faster with all of the context about the impact of the issue across your full stack, and missing this context can lead you down rabbit holes of investigation and troubleshooting that won’t be worth your time.

A holistic solution
SignifAI was founded by a team of SREs that saw hope for a clearer picture — a streamlined, automatically prioritized system of urgent and important alerts. The many layers of machine learning-driven filters and correlation logic powering Chewie and the SignifAI Decision engine look for all of these sources of noise and adapt to always be providing you with more relevant alerts, empowering your team to stay focused on important issues.

Curious? Check out more about SignifAI here:

  • Chewie, next-level incident management on top of your existing platform
  • Decisions, the engine powering alert correlation across your full stack

Originally published at blog.signifai.io on October 4, 2018.

--

--