Tenets of SRE
Stephen Thorne
1101

In order to reduce false alerts, have you tried using a strategy of triggering alerts based on percentile values of raw monitor values instead of raw values?

If so, how has it performed? What has been its shortcomings?

By setting triggers to alert on percentiles instead of raw data, triggers adapt to changes in performance curves. Subsequently, one expects fewer false alerts to trigger in the dynamic operating range.

I’m going to be testing this technique in a few weeks as part of suite of apps for small-time system admins. It takes a little work to get the statistics accurately calculated and normalized, because sampling rates are not synchronous, and binary math can be a little fuzzy. Combining statistical curves can be a bit tricky also. Thankfully, the math has been confirmed by a Russian mathematician that published work before computers were in use. This process has been used to re-work a process network analysis aka critical path methodology of project management into an eXtra Dynamic Critical Path Method to produce more accurate estimates. I’m excited to see what I’ll learn from applying this machine learning technique in this way.