How to reduce alert noise and increase service performance visibility
A pragmatic approach to computed metric alerting
- How many nights have you or your operations team been woken up unnecessarily?
- How many critical alerts have you seen resolve within minutes of firing?
- How many times have you had to investigate and declare a “false alarm”?
The answer is most likely “too many”. Perhaps attempts have been made to “tune” these fragile alerts individually, but somehow they keep resurfacing; forcing you to play a never-ending game of whack-a-mole. This brittleness points to the presence of a fundamental flaw in your alerting design which must be corrected in order to reach a stable solution.
Understanding the current alerting design
Before jumping into a new and improved alerting design, we must first fully understand the current design and its limitations in order to ensure no net-loss in functionality.
- How does it actually work? (the documentation, if it exists, may not be sufficient)
- What are the cases that need to be handled?
- Where does it fall short?
A common alerting strategy is to simply compare raw metric values to a pre-defined threshold. If the metric values exceed the threshold, then an alert is fired. At Invoca, we have our own alerting system and the syntax for defining alerts would read like the following where “metric” would be the raw metric value and “4000” would be the threshold:
warn_if metric > 4000
Service Breakage Case
Threshold alerting easily handles the service “breakage” case which is often characterized in a boolean manner.
It was working before, but now it’s not working at all.
Metrics will usually reflect this in the form of a wall or a cliff that continues on without returning back to normal operating levels. Notice in the following example how the threshold is exceeded continuously after a particular point.
The curious case of “Metric Spikes”
It is common for even the healthiest systems to experience one-off spikes in metric data. These one-off metric spikes can occasionally exceed the threshold and cause alerts to fire when no real issue is present.
In order to guard against one-off metric spikes, an optional condition can be added which requires that the threshold must be exceeded for a specified amount of time before the alert is fired.
warn_if metric > 4000, for_at_least: 5.minutes
Rather than immediately alert when the threshold is exceeded, the threshold must instead be exceeded continuously for 5 minutes. The following example demonstrates the firing behavior for this configured alert:
Degraded Service Performance Case
What if the threshold is exceeded consistently but not continuously?
In other words, the service may return to normal operating levels intermittently, but is still consistently exceeding the threshold as shown below.
If we applied the aforementioned alert to this example, what would happen?
Notice how each spike exceeds the threshold for only 3 minutes at a time. Our alert requires that the threshold is exceeded for 5 continuous minutes before firing.
In this case, we would not have alerted even though the service may be experiencing enough degraded performance to warrant attention.
All may not be lost. There is a known strategy for “handling” this case.
False hope: Applying a Moving Average to noncontiguous data
An alternative approach for alerting on noncontiguous data is to compute a moving average based on the last N data points. However, this requires that we adjust the alert to reference the computed average values instead of the raw metric data.
For example, we might specify a moving average with a window of 10 minutes which would equate to 10 data points since 1 data point is reported every minute. This means that every computed data point is the average of the last 10 reported data points.
With that understanding, let’s apply the following alert to our Degraded Service Performance example.
metric = movingAverage(…, ‘10min’)
warn_if metric > 4000, for_at_least: 5.minutes
We successfully alerted! … but just barely
Computing the average smoothed out the spikes which resulted in enough contiguous data points to alert on. However, the average was also lower in overall magnitude compared to the original spikes.
This is certainly expected when computing an average, but it almost didn’t stay above the threshold. Had the spikes been any lower while still exceeding the threshold, then the average might not have exceeded the threshold enough to alert as shown below.
Lower the threshold?
If we still want to alert in this situation even though the values are artificially lowered by the average, then the threshold should be lowered as well. In order to compensate, the threshold might need to be brought down from 4000 to 3500.
However, we’ve now lost the meaning of the original threshold which was likely tied to a specific SLA or SLO. This requires us to keep track of both the original performance requirement and what that requirement looks like after applying a moving average. Furthermore, if the moving average configuration ever changes, then the threshold would also have to change!
What if a single large spike happens?
Due to the large spike, the computed average exceeds the threshold for the length of the moving window and we alert unnecessarily. Any single source value change will directly affect the resulting average because averages are always dependent on their source values.
No matter how you adjust the threshold, there will always be a chance for a large enough spike to trigger an alert whenever a moving average is used.
Perhaps some are willing to risk noisy alerts by using moving averages, but…
We need something better. We need to handle the degraded service performance case without making ourselves susceptible to occasional metric spikes. As far as we can tell, there are no obvious alternatives available, so let’s be adventurous and innovate further!
Implementing a pragmatic alerting design
Now that we’ve reviewed the current design and its limitations, we are ready to continue the innovation process… but where do we go from here?
Instead of jumping into “fixing” the individual cases, we have found that it is better to first take a step back and dream up what we want.
- What is our end goal?
- What is our wish list?
Our End Goal for Alerting
At Invoca, we have invested in implementing clearly defined SLA, SLO, and SLI combinations for each of our services. These are essentially service performance requirements and measurements that an organization commits to maintain.
For example, a requirement might take the following form:
99% of requests should render landing pages successfully within 1 second
The next step is to implement a metric which measures our services against this requirement and ultimately alerts as soon as a service no longer meets this requirement.
Our end goal is to be notified as soon as our system fails to meet the performance requirements.
Our Alerting Interface Wish List
- Alert if and only if the performance requirement is not being met. Pay attention to consistent spikes, but ignore one-off spikes!
- Directly and seamlessly translate the performance requirement into a coded alert. i.e. Reduce or eliminate the guesswork of defining alerts!
- Ability to standardize alert expressions. Keep it simple; keep it consistent.
Maybe we can’t achieve everything on our wish list, but we at least have a better understanding of what we actually want. This understanding then informs our approach to figuring out how to achieve it.
There’s nothing truly innovative about this thought exercise. It is simply a pragmatic approach to design. However, we often become so wrapped up in “fixing” surface level issues that we forget to look deeper and dream higher.
The Dream Realized
Moving Percent-Over-Threshold Alerting
“Moving what-now?” Yes, it sounds confusing, but let’s unpack this strategy that we came up with, piece by piece:
- How did our goals help us innovate?
- What’s the strategy?
- How does it perform against the aforementioned use cases?
- How many of our goals and wish list items did we ultimately achieve?
If you were expecting a step-by-step guide on how to force yourself to have an epiphany based on defined goals and wishes, then we have some sobering news for you…
Ideas can’t be forced in as much as you can’t force a seed to grow into a tree.
You can, however, provide nutrients, water, and sunlight to support natural growth. In the same manner, ideas can be nurtured, but with different supporting elements: problem familiarity, resource understanding, and goal visualization.
Innovation occurs when a new “neural connection” is made. We should not expect new connections to be made by repeatedly looking at the same problem from the same perspective. New perspectives open up new opportunities for neural connections. Developing and visualizing creative goals is just one way to open yourself up to new perspectives and new ideas.
That said, it is still important to understand the problem domain and limitations of existing resources in order to keep these “neural connections” grounded in reality.
The neural connection we “experienced” that sparked this new approach was due to the goal of “translating performance requirements into coded alerts”. Our performance requirements are in the form of percentages, so we explored the problem from the perspective of percentages which resulted in the idea for a computed Percent-Over-Threshold metric.
Percent-Over-Threshold (POT): Just another computed metric
Given a set of data points, count how many data points exceeded the defined threshold and return this count as a percentage of the total data points provided.
For example, if 3 out of 10 data points exceeded the threshold, then the computed POT would be 30%.
DataPoints: [500, 500, 10000, 10000, 10000, 500, 500, 500, 500, 500]
Computed POT: 30
This is similar to the moving average in that we perform a calculation over the last N data points and return a single value for each calculation. The code representation might look something like the following:
metric = compute_percent_over_threshold
minimum_samples: 10warn_if metric >= 100
In order to trigger this alert, the threshold would have to be exceeded by 100% of the data points for the last 10 minutes (or last 10 data points). Of course, we already had this capability with the previous alerts. Even so, this computed metric truly shines once we start defining percentage thresholds below 100%. Doing so allows us to successfully alert on the degraded service performance use case. Let’s see how well POT handles this tough edge case!
Applying computed POT to the tough edge cases
#1 Degraded Service Performance
While the moving average was able to smooth out consistent spikes, it was also heavily influenced by the magnitudes of these spikes.
Comparing the graphs below, you’ll see that the computed POT data points are not affected at all by the changing magnitude of the spikes!
This allows us to define and maintain a consistent threshold to alert on:
Alert if 50% of the data points exceed the threshold for the last 10 minutes.
#2 Single Large Spike
Again, the computed POT is unaffected by the magnitude of spikes. Reviewing the graph below, we see that the spike consisted of only a single data point and the computed POT simply reports that 10% of the last 10 data points exceeded the threshold.
The computed POT successfully remained below our 50% threshold!
Achieving Goals and Granting Wishes
What’s the point of having goals and making wish lists if we don’t celebrate completing them?
Let’s review how well we did at doing so:
Alert if and only if the performance requirement is not being met. Pay attention to consistent spikes, but ignore one-off spikes!
As demonstrated above, we’ve shown that computed POT handles the tough edge cases without sacrificing sensitivity. +1 for robustness!
Directly and seamlessly translate the performance requirement into a coded alert. i.e. Reduce or eliminate the guesswork of defining alerts!
Since the computed POT metric is percentage based, we can now have a closer correlation between the alert and the corresponding SLA or SLO assuming their notation is also percentage based. +1 for congruity!
Ability to standardize alert expressions. Keep it simple; keep it consistent.
In addition to replacing moving averages, computed POT can also replace the standard contiguous-point based alerts. This gives us the option to completely standardize on the computed POT format if we so choose. While the concept is likely more straightforward to reason about, we did slightly increase the complexity by adding a second threshold to the mix (the percentage threshold). +1 for standardization, -1 for increased complexity
Overall, that’s a net quality gain!
FAQ for tuning computed POT alerts
What should I choose for minimum samples?
The minimum samples are essentially your window of calculation and have the following characteristics:
- Small windows are more sensitive to sudden changes. Each data point holds greater weight towards triggering the alert. Spikes are more likely to trigger the alert, but the alert will respond faster to issues.
- Large windows are less sensitive to sudden changes. Each data point holds lesser weight towards triggering the alert. Spikes are less likely to trigger the alert, but the alert will respond slower to issues.
In the case of critical system functionality, small windows may be worth the risk of false alarms in order to achieve a shorter response time. In the case of high variance systems, large windows may be worth having longer response times in order to avoid frequent false alarms.
Answer: Choose the minimum sample value according to the preferred alert response time and risk of false alarms.
What should I choose for a percentage threshold?
When choosing a percentage threshold, remember that the possible percentage values are coupled to the minimum samples specified. The possible percentage values for a given minimum sample size are as follows:
1 sample: [0%, 100%]
2 samples: [0%, 50%, 100%]
3 samples: [0%, 33%, 66%, 100%]
4 samples: [0%, 25%, 50%, 75%, 100%]
and so forth…
Answer: The percentage threshold should be tied to the appropriate SLA or SLO, but make sure that it fits with the possible percentage values. Adjusting the minimum samples may be necessary to provide a better fit.
When should I specify a “for at least” condition?
Perhaps a suitable minimum sample size has already been chosen, but spikes are still an issue. Instead of increasing the minimum sample size, a “for at least” condition could be specified instead to guard against longer lasting spikes which are generally not long enough to warrant an alert. Increasing the “for at least” condition is essentially equivalent to increasing the minimum sample size.
Answer: Add a “for at least” condition if increasing the minimum sample size is undesirable.
- False alarms are the worst
- Understand your system before you “fix” it
- Say NO to moving average based alerts!
- Goal-driven design instead of issue-driven
- Computed Percent-Over-Threshold alerting works!
- Practice honest celebration of goal completion
Thanks for reading! We hope your understanding of alerting has increased and that you’ll try out computed percent-over-threshold for yourself.
If you have any questions, feedback, or further ideas on the subject, please leave us a comment!
This post is one example of how at Invoca we are never satisfied with the status quo, and always pushing the boundaries for delivering high quality and reliable service to our customers. We’re hiring.