Pager duty sucks. Nothing good happens when I’m on duty, and this is my week on. Most of the time, we just get hit with false alarms, no ops, and strange anomalies (#wtfrackspace). Of course all of this makes it tough to take real alerts seriously. Alert fatigue is real, and the struggle continues.

“the next job I take, won’t have call rotations”

When I think of “on-call rotations”, I can’t help but think about dated 90’s TV sitcoms like “Chicago Hope” or music videos on MTV. Being on-call today in 2016 feels like two steps backwards. I was at a devops networking event last year where one of the attendees told me, “the next job I take, won’t have call rotations”. I liked the sound of that.

I especially detest splitting up on-call rotations between team members. Single person call rotations are setup for failure, call it an anti-pattern. When small groups of people have to deal with failure(s), it means that uptime and quality is not a priority for your company. The company is really saying, “please shield the rest of us away from that noise, so that the rest of us work on other tasks, which are way more important than that alert”. On the flip side, alerts should only get triggered for monumental events, not microservice hiccups. Every alert should come with links to detailed SOPs, and each alert should have a ticket created programmatically. If these steps sound over the top for an alert, then your alerts are probably “events” and are better sent via an email or a prompt on a NOC dashboard. Alerts should be treated with respect.

You reap what you sow.

To be truly effective, I think organizations should alert the entire operations and software development teams. If the software or hardware alerts, no one gets out alive. If one person gets paged at 3AM, everyone gets paged. Yes, that’s right, go ahead and send those pager duty alerts to the CTO. I bet those “critical” alerts at 3AM start to go down once everyone’s phone starts to go off. So I say, stop worrying about the on-call tier levels and instead clean-up the code, build highly redundant services, and incorporate proactive failure testing in the environment. You reap what you sow.