How to reduce alert noise and increase service performance visibility

A pragmatic approach to computed metric alerting

Gabriel Kent
Nov 6, 2019 · 12 min read
Image for post
Image for post

“False Alarm”

  • How many nights have you or your operations team been woken up unnecessarily?

The answer is most likely “too many”. Perhaps attempts have been made to “tune” these fragile alerts individually, but somehow they keep resurfacing; forcing you to play a never-ending game of whack-a-mole. This brittleness points to the presence of a fundamental flaw in your alerting design which must be corrected in order to reach a stable solution.

Understanding the current alerting design

Before jumping into a new and improved alerting design, we must first fully understand the current design and its limitations in order to ensure no net-loss in functionality.

  • How does it actually work? (the documentation, if it exists, may not be sufficient)
Image for post
Image for post

Threshold Alerting

A common alerting strategy is to simply compare raw metric values to a pre-defined threshold. If the metric values exceed the threshold, then an alert is fired. At Invoca, we have our own alerting system and the syntax for defining alerts would read like the following where “metric” would be the raw metric value and “4000” would be the threshold:

warn_if metric > 4000

Service Breakage Case

Threshold alerting easily handles the service “breakage” case which is often characterized in a boolean manner.

It was working before, but now it’s not working at all.

Metrics will usually reflect this in the form of a wall or a cliff that continues on without returning back to normal operating levels. Notice in the following example how the threshold is exceeded continuously after a particular point.

Image for post
Image for post

The curious case of “Metric Spikes”

It is common for even the healthiest systems to experience one-off spikes in metric data. These one-off metric spikes can occasionally exceed the threshold and cause alerts to fire when no real issue is present.

Image for post
Image for post

In order to guard against one-off metric spikes, an optional condition can be added which requires that the threshold must be exceeded for a specified amount of time before the alert is fired.

warn_if metric > 4000, for_at_least: 5.minutes

Rather than immediately alert when the threshold is exceeded, the threshold must instead be exceeded continuously for 5 minutes. The following example demonstrates the firing behavior for this configured alert:

Image for post
Image for post

Degraded Service Performance Case

What if the threshold is exceeded consistently but not continuously?

In other words, the service may return to normal operating levels intermittently, but is still consistently exceeding the threshold as shown below.

Image for post
Image for post

If we applied the aforementioned alert to this example, what would happen?

Notice how each spike exceeds the threshold for only 3 minutes at a time. Our alert requires that the threshold is exceeded for 5 continuous minutes before firing.

In this case, we would not have alerted even though the service may be experiencing enough degraded performance to warrant attention.

All may not be lost. There is a known strategy for “handling” this case.

Image for post
Image for post

False hope: Applying a Moving Average to noncontiguous data

An alternative approach for alerting on noncontiguous data is to compute a moving average based on the last N data points. However, this requires that we adjust the alert to reference the computed average values instead of the raw metric data.

For example, we might specify a moving average with a window of 10 minutes which would equate to 10 data points since 1 data point is reported every minute. This means that every computed data point is the average of the last 10 reported data points.

With that understanding, let’s apply the following alert to our Degraded Service Performance example.

metric = movingAverage(…, ‘10min’)
warn_if metric > 4000, for_at_least: 5.minutes
Image for post
Image for post

We successfully alerted! … but just barely

Computing the average smoothed out the spikes which resulted in enough contiguous data points to alert on. However, the average was also lower in overall magnitude compared to the original spikes.

This is certainly expected when computing an average, but it almost didn’t stay above the threshold. Had the spikes been any lower while still exceeding the threshold, then the average might not have exceeded the threshold enough to alert as shown below.

Image for post
Image for post

Lower the threshold?

If we still want to alert in this situation even though the values are artificially lowered by the average, then the threshold should be lowered as well. In order to compensate, the threshold might need to be brought down from 4000 to 3500.

However, we’ve now lost the meaning of the original threshold which was likely tied to a specific SLA or SLO. This requires us to keep track of both the original performance requirement and what that requirement looks like after applying a moving average. Furthermore, if the moving average configuration ever changes, then the threshold would also have to change!

Image for post
Image for post

What if a single large spike happens?

Image for post
Image for post

Due to the large spike, the computed average exceeds the threshold for the length of the moving window and we alert unnecessarily. Any single source value change will directly affect the resulting average because averages are always dependent on their source values.

No matter how you adjust the threshold, there will always be a chance for a large enough spike to trigger an alert whenever a moving average is used.

Perhaps some are willing to risk noisy alerts by using moving averages, but…

Image for post
Image for post

We need something better. We need to handle the degraded service performance case without making ourselves susceptible to occasional metric spikes. As far as we can tell, there are no obvious alternatives available, so let’s be adventurous and innovate further!

Implementing a pragmatic alerting design

Now that we’ve reviewed the current design and its limitations, we are ready to continue the innovation process… but where do we go from here?

Instead of jumping into “fixing” the individual cases, we have found that it is better to first take a step back and dream up what we want.

  • What is our end goal?
Image for post
Image for post

Our End Goal for Alerting

At Invoca, we have invested in implementing clearly defined SLA, SLO, and SLI combinations for each of our services. These are essentially service performance requirements and measurements that an organization commits to maintain.

For example, a requirement might take the following form:

99% of requests should render landing pages successfully within 1 second

The next step is to implement a metric which measures our services against this requirement and ultimately alerts as soon as a service no longer meets this requirement.

Our end goal is to be notified as soon as our system fails to meet the performance requirements.

Image for post
Image for post

Our Alerting Interface Wish List

  • Alert if and only if the performance requirement is not being met. Pay attention to consistent spikes, but ignore one-off spikes!

Maybe we can’t achieve everything on our wish list, but we at least have a better understanding of what we actually want. This understanding then informs our approach to figuring out how to achieve it.

There’s nothing truly innovative about this thought exercise. It is simply a pragmatic approach to design. However, we often become so wrapped up in “fixing” surface level issues that we forget to look deeper and dream higher.

Image for post
Image for post

The Dream Realized

Moving Percent-Over-Threshold Alerting

“Moving what-now?” Yes, it sounds confusing, but let’s unpack this strategy that we came up with, piece by piece:

  • How did our goals help us innovate?
Image for post
Image for post

Goal-based Innovation

If you were expecting a step-by-step guide on how to force yourself to have an epiphany based on defined goals and wishes, then we have some sobering news for you…

Ideas can’t be forced in as much as you can’t force a seed to grow into a tree.

You can, however, provide nutrients, water, and sunlight to support natural growth. In the same manner, ideas can be nurtured, but with different supporting elements: problem familiarity, resource understanding, and goal visualization.

Innovation occurs when a new “neural connection” is made. We should not expect new connections to be made by repeatedly looking at the same problem from the same perspective. New perspectives open up new opportunities for neural connections. Developing and visualizing creative goals is just one way to open yourself up to new perspectives and new ideas.

That said, it is still important to understand the problem domain and limitations of existing resources in order to keep these “neural connections” grounded in reality.

The neural connection we “experienced” that sparked this new approach was due to the goal of “translating performance requirements into coded alerts”. Our performance requirements are in the form of percentages, so we explored the problem from the perspective of percentages which resulted in the idea for a computed Percent-Over-Threshold metric.

Image for post
Image for post

Percent-Over-Threshold (POT): Just another computed metric

Given a set of data points, count how many data points exceeded the defined threshold and return this count as a percentage of the total data points provided.

For example, if 3 out of 10 data points exceeded the threshold, then the computed POT would be 30%.

Threshold: 4000
DataPoints: [500, 500, 10000, 10000, 10000, 500, 500, 500, 500, 500]
Computed POT: 30

This is similar to the moving average in that we perform a calculation over the last N data points and return a single value for each calculation. The code representation might look something like the following:

metric = compute_percent_over_threshold
series: ...,
threshold: 4000,
minimum_samples: 10
warn_if metric >= 100

In order to trigger this alert, the threshold would have to be exceeded by 100% of the data points for the last 10 minutes (or last 10 data points). Of course, we already had this capability with the previous alerts. Even so, this computed metric truly shines once we start defining percentage thresholds below 100%. Doing so allows us to successfully alert on the degraded service performance use case. Let’s see how well POT handles this tough edge case!

Applying computed POT to the tough edge cases

#1 Degraded Service Performance

While the moving average was able to smooth out consistent spikes, it was also heavily influenced by the magnitudes of these spikes.

Comparing the graphs below, you’ll see that the computed POT data points are not affected at all by the changing magnitude of the spikes!

This allows us to define and maintain a consistent threshold to alert on:

Alert if 50% of the data points exceed the threshold for the last 10 minutes.

Image for post
Image for post
Consistent Spikes. High Magnitude.
Image for post
Image for post
Consistent Spikes. Low Magnitude.

#2 Single Large Spike

Again, the computed POT is unaffected by the magnitude of spikes. Reviewing the graph below, we see that the spike consisted of only a single data point and the computed POT simply reports that 10% of the last 10 data points exceeded the threshold.

The computed POT successfully remained below our 50% threshold!

Image for post
Image for post
Spike consisting of a single data point
Image for post
Image for post

Achieving Goals and Granting Wishes

What’s the point of having goals and making wish lists if we don’t celebrate completing them?

Let’s review how well we did at doing so:

Alert if and only if the performance requirement is not being met. Pay attention to consistent spikes, but ignore one-off spikes!

As demonstrated above, we’ve shown that computed POT handles the tough edge cases without sacrificing sensitivity. +1 for robustness!

Directly and seamlessly translate the performance requirement into a coded alert. i.e. Reduce or eliminate the guesswork of defining alerts!

Since the computed POT metric is percentage based, we can now have a closer correlation between the alert and the corresponding SLA or SLO assuming their notation is also percentage based. +1 for congruity!

Ability to standardize alert expressions. Keep it simple; keep it consistent.

In addition to replacing moving averages, computed POT can also replace the standard contiguous-point based alerts. This gives us the option to completely standardize on the computed POT format if we so choose. While the concept is likely more straightforward to reason about, we did slightly increase the complexity by adding a second threshold to the mix (the percentage threshold). +1 for standardization, -1 for increased complexity

Overall, that’s a net quality gain!

Image for post
Image for post

FAQ for tuning computed POT alerts

What should I choose for minimum samples?

The minimum samples are essentially your window of calculation and have the following characteristics:

  • Small windows are more sensitive to sudden changes. Each data point holds greater weight towards triggering the alert. Spikes are more likely to trigger the alert, but the alert will respond faster to issues.

In the case of critical system functionality, small windows may be worth the risk of false alarms in order to achieve a shorter response time. In the case of high variance systems, large windows may be worth having longer response times in order to avoid frequent false alarms.

Answer: Choose the minimum sample value according to the preferred alert response time and risk of false alarms.

What should I choose for a percentage threshold?

When choosing a percentage threshold, remember that the possible percentage values are coupled to the minimum samples specified. The possible percentage values for a given minimum sample size are as follows:
1 sample: [0%, 100%]
2 samples: [0%, 50%, 100%]
3 samples: [0%, 33%, 66%, 100%]
4 samples: [0%, 25%, 50%, 75%, 100%]
and so forth…

Answer: The percentage threshold should be tied to the appropriate SLA or SLO, but make sure that it fits with the possible percentage values. Adjusting the minimum samples may be necessary to provide a better fit.

When should I specify a “for at least” condition?

Perhaps a suitable minimum sample size has already been chosen, but spikes are still an issue. Instead of increasing the minimum sample size, a “for at least” condition could be specified instead to guard against longer lasting spikes which are generally not long enough to warrant an alert. Increasing the “for at least” condition is essentially equivalent to increasing the minimum sample size.

Answer: Add a “for at least” condition if increasing the minimum sample size is undesirable.

Image for post
Image for post

TL;DR Summary

  • False alarms are the worst

Thanks for reading! We hope your understanding of alerting has increased and that you’ll try out computed percent-over-threshold for yourself.

If you have any questions, feedback, or further ideas on the subject, please leave us a comment!

This post is one example of how at Invoca we are never satisfied with the status quo, and always pushing the boundaries for delivering high quality and reliable service to our customers. We’re hiring.

Invoca Engineering Blog

Invoca is a SaaS company helping marketers optimize for the…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store