Pad the Denominator

Ken Shih
Making Meetup
Published in
4 min readMay 27, 2021

--

Photo by Michal Matlon on Unsplash

Quick Tip: Pad the Denominator to avoid noisy error rate alarms

A little DevOps ditty… Ok, this isn’t an amazing insight, but no one’s ever mentioned this to me before, so I thought I’d share. Let me know what you think.

Scenario

You’ve rolled out a new service and you alert on error rate. Good on you!

It’s a new service, so sometimes you have low traffic and you find that it gets a small number of recoverable errors and sometimes even a single error causes it to alarm!

Everyone tells you that it’s a “noisy alert” and that makes you sad. What do you do?

Solution

Pad the Denominator to avoid noisy error rate alarms

That is, error rate is normally calculated like this:

error-rate = 100% * ( error-count / message-count )

and alert if error-rate > X% for Y data points. Hint: avoid alerting on a single data point!

Instead, do this:

error-rate = 100% * ( error-count / ( message-count + FUDGE-FACTOR))

That way, when traffic is low, your error rate is insensitive to typical recoverable errors like slow startup time or network effects when your application has reasonable retry logic. Yet, it will still alarm if your service is erroring significantly!

At the beginning of the life of a service:

  1. Traffic might be tiny — since you may be doing a gradual rollout, alpha testing, or A/B testing in production, we know the network is not reliable, so errors are expected!
  2. Traffic might be discontinuous — since projects don’t live in a vacuum but are subject to be used based on strategic priorities or delays in other work streams, feature rollouts, so your system might go through starts-and-stops
  3. Your implementation is clunkier at the start — You should always start simple, last responsible moment and all, but that means things are clunky. For example, for a new AWS Lambda, you may not have handled cold starts well yet, you haven’t had time to set up a blue-green deployment, your memory needs could be tweaked, you don’t know about provisioned concurrency, you might be new to the team/technology/domain so unexpected kinks are everywhere, but everything still works well for your users!

Hardening a system takes time and you don’t want/need to wake up yourself or your teammate at 3am for trivialities causing alarm fatigue and loss of goodwill. But you still want to be paged when there’s a real problem!

What is nice about this little trick of padding your denominator is that your alert is resilient in many conditions:

  1. Beginning of service life when traffic is low and spotty
  2. Middle of service life with regular & high traffic
  3. After incidents and in discontinuous recovery periods
  4. End of service life when traffic is decreasing

Padding the denominator lets your alarm, like a jazz musician, play through the changes.

Help keep burnout low on your team and for yourself.

Pad the Denominator!

Real Example: Testing Multi-Regional Support

This was an alarm that I noticed had been going off one or more times a day for weeks. Because it auto-resolved almost immediately every time, no one bothered to clean it up.

The problem was that we had introduced an experiment that added support in an additional AWS region, us-west-2, in addition to the original region we had originally deployed the service in, us-east-1.

The implementor did a great job with the service and had also done the right thing by adding the alarms, but because it was an experiment, us-west-2 got a much smaller amount of traffic than the original, yet it had an almost identical Terraform template, including CloudWatch alarm definitions like:

metric_query {
id = “errorRate”
expression = “((errors * 100) / requestCount )”
label = “4XX error rate”
}

Reason for the alarm: While some HTTP 4XXs are expected in cases of user error, an unusual amount of such errors would be an aberrant case worthy of investigation. We were alerting when errorRate became greater than 50%. Pretty conservative.

us-east-1 was getting moderate, but consistent traffic averaging some 1000s of requests per minute (rpm) ranging from ~500 rpm to ~3000 rpm.

us-west-2 was getting a sample rate of about 50 rpm, but varied from~0 rpm to ~100 rpm.

This meant there were periods of time when traffic was really low on us-west-2, just a small handful of 4XXs would put us in alarm state!

To correct the problem, we padded the denominator…

expression = “((errors * 100) / ( requestCount + 50))”

with the desire to ignore error % when cardinality was closer to 0, yet being responsive to consistent high rates of error, especially once the service was starting to take normal continuous amounts of traffic.

A little change like this alone can save countless hours of uninterrupted sleep, loss of goodwill, jadedness, and alert fatigue. Making such habits normal practice of every on-call engineer can help combat burnout, PTSD, and help keep engineers fresh for when they are needed for an actual emergency.

I know folks who do graphics programming have these sorts of fudges to, say, make an edge of a triangle look right, even if it looks mathematically incorrect in code. I guess we do the same things for DevOpsX as we do for UX!

Software engineering is a human endeavor. Taking care of these details, acknowledging the organic nature of where computer systems meet human systems, we endeavor to engineer humanely.

--

--