Alerting Guidelines

Aron Eidelman
Google Cloud - Community
6 min readFeb 15, 2023
An abstract diagram of a few “alert” icons above a surface, leading down to branching graph that ends in round nodes, similar to a root structure.

The ideas reflected in this post do not necessarily reflect the opinions, attitudes, and statements of my employer or anyone associated with me.

Engineering organizations need to be able to quickly and easily identify and resolve issues, but this need often takes a backseat compared to other priorities. With so many things to track, it can be challenging to know which ones need alerts. It is possible to set up too many alerts, create too much noise, and lose relevance when faced with real problems. Because of this, we need to clarify why we are creating an alert in two respects:

  • the relevance of what we’re monitoring to our business goals
  • the intended outcome once we notify someone there’s an issue

Understanding the relevance of what we’re monitoring can help us support triage in advance. When an issue occurs, we should have enough context to gauge its impact on users, potential cost, and what other problems it can take priority over. Relevance also helps us identify which signals warrant setting objectives. For example, latency is incredibly relevant in a customer-facing application that serves real-time data — yet irrelevant in a weekly cron job to clean up old files across systems. Relevance based on context helps us answer why, what, and when to alert.

The outcomes we intend from an alert notification vary widely, from immediate response to situational awareness. We should not be waking up our on-call engineers for an SLO breach that is outside their power to fix and affects our release velocity. We should also not be tucking away all of our alerts in emails since the more critical issues could slip by. We are determining the outcomes we want, which can help us figure out who to alert and how.

Why should I create alerts for only some things?

When I create an alert for something, the assumption is that someone can and should take action if notified.

Think of alerts you might experience in everyday life:

  • A flash flood warning on our phone
  • A text message about a declined credit card transaction
  • An email about a “login from a new device”

Each of these alerts could exist on a website or in a log, but we’re getting notified directly about each one. We often also get instructions about what to do:

  • Avoid roads, stay at home, and prepare an emergency kit
  • Agree that you intended to make the transaction and try again or report it as fraud
  • Trust the new device, or block access and immediately change your password

In some cases, even if we feel the alert isn’t relevant or are already aware of an issue, we still see the value of having the alert on in general. And even if we need to do some deeper digging, e.g., looking up the store’s name where the transaction took place or verifying the device matches what we’re using, the alert is a good starting point.

The zombie strategy of alerting, where too many metrics have alerts, potentially leads to a situation where the critical issues do not stand out. The saying goes, “If everything is important, nothing is important.” Imagine getting a flash flood warning for every flash flood on Earth, not just the ones in your area. Imagine a text message for every transaction you make, not just the declined ones. How about an email every time you log in, even from the same device? These would condition you to ignore the more relevant warnings or otherwise distract you with noise.

We have a starting point: we should have alerts for things insofar as we can fix, influence, or control them, and insofar as not knowing about the alert could be worse for us. Still, that leaves a large set of possibilities that we need to trim down, starting with what to alert on.

What should I alert on?

As we increasingly automate systems, something interesting happens: the number of things that should be actionable should decrease. So what stays the same?

No matter how much we automate, the constant area of focus is on relevance to users. In Why Focus on Symptoms, Not Causes, we explored why user-facing symptoms are a better area of focus for alerting than causes.

We should alert on things that are actionable and relevant to users. A few examples:

  • Availability. 500s, unintentional 400s, hanging requests, redirects to malicious sites–all count. Whether it’s the entire site or a small third-party component, anything that disrupts the critical user journey should be considered “unavailability.”
  • Latency. Fast (as long as it’s humans that are waiting).
  • Integrity/durability. The data should always be safe. Even if the data is temporarily unavailable, it should be correct when it comes back.

Starting from symptoms, you may occasionally discover that there are still “causes” you need to alert on–something deep within your system that, while its behavior may be invisible to users, can still impact them down the road. Ideally, you’ll move to a state where you can automate responses to any of these issues, but it makes sense to have a plan in place.

An essential rule of thumb to keep in mind is that alerts are only the “start” of an action or an investigation; they do not form the entirety of your strategy. Not all metrics should have a corresponding alert. If a system grows and becomes more automated, it would be normal to monitor more and alert less.

When should I trigger an alert?

Picking the right time to trigger an alert relates to how actionable and severe the impact is.

Suppose latency exceeds 300ms, or there is a spike in 500 errors for less than a few seconds. In each case, the anomaly may be worth looking into — but these are less severe than when high latency or error rates persist for several minutes. If an issue is ongoing, it needs direct intervention, whereas an issue that has already happened may require investigation. These deserve different levels of attention.

Sometimes, an alert serves as an early warning for a situation that would be too complicated to act on if it were to arrive. For example, in monitoring the quota consumption of a cloud service, it’s necessary to know well in advance if a service is approaching maximum usage so that a customer can request a limit increase, which can take time to approve.

Who should I notify, and how?

Once you have identified the people who need to know about an issue, you can decide how to notify them. For example, if the problem is urgent, you may need to page a person directly. You can create a ticket without paging anyone if the issue is less pressing. While this may seem painfully simple, a common problem is using only one notification channel for every severity level.

If someone gets paged all the time, even for minor issues, they get stressed out and may get distracted from the priority. If alerts go to a low-priority channel, like a group email (which tends to be the default for cloud quota consumption), people may only see them after it’s too late. Only notify people who need to know that there is a problem, and leave it to their discretion to inform others. It may take some planning, but the goal is to avoid spamming people with irrelevant notifications. Different channels enable prioritization and appropriate visibility.

If the on-call person is the only person who can resolve an urgent issue, you should page them. There are plenty of urgent situations where the person on-call can’t handle the resolution on their own, so paging them is just the first step in an urgent escalation. (See a separate discussion and examples of incident management here.) If the issue isn’t critical, the alert should generate a ticket in the queue. The on-call person can still work on it–just not as an “interrupt.”

Concluding Guidelines

From establishing the relevance and intended outcomes of alerts, we can use quick rules of thumb to help us get the most out of alerts by keeping a high signal-to-noise ratio.

What to alert on:

  • Alerts should be actionable and relevant to users.
  • Some examples of things to alert on are availability, latency, and integrity/durability.
  • Alerts are the “start” of an action or an investigation; they may only represent a small portion of what you monitor.

When to alert on things:

  • Alerts can be more or less urgent depending on how long the issue has been going on and how severe the impact is. Consider that impact changes over time.
  • You can use alerts to give people enough time to act before there are consequences for issues such as quota consumption.

Who to alert, and how:

  • Only notify people you intend to act in response, and trust them to inform more people if needed.
  • Page the person on-call if the situation requires an immediate response.
  • Consider not paging the person on-call and creating a ticket if the issue is not urgent.

If these guidelines helped you create effective alerts or you have any questions, please join us at the next Reliability Engineering Discussion.

--

--

Aron Eidelman
Google Cloud - Community

DevSecOps at Google, Board Chair at Azure Printed Homes, Dadalorian at Home