Alerting is actually a very small part of the “On-call” story
With a sliver of light peaking through my squinted eye, I reach towards the source of noise jolting me awake.
I never sleep well when I’m on-call anyway. The anxiety of being paged and sleeping through it keeps you from resting very well.
I knocked it off the night stand.. again.
Twenty seconds feels more like several minutes before I‘m able to ACK the alarm. Panic does strange things to one’s senses. Especially the perception of time.
When an alarm is going off at three-fucking-thirty-five in the morning, each blaring “Aaaaant! Aaaaant! Aaaaaant!” not only swells in volume but synchronizes your heart rate with it’s… Perfect. Menacing. Tempo.
I've got a problem and I have no clue what or how bad it is!
I guess I’m getting out of bed. In the middle of the night. To fix a problem someone else caused.
Thanks for waking me up! Is there anything you can tell me?
I thought there was supposed to be some sort of “Transmogrified” piece of context with that alert. That’s what I was told anyway.
Having even the slightest clue about what this alarm is about would be super helpful right now!
Okay, what the hell is going on? Where do I look?
It’s an alarm from Zabbix.
At least that’s what the VictorOps timeline is telling me moments after I open the app on my phone.
A bunch of non-sense words and phrases in most of it, but at least I have a starting point.
I see something in the message that says CRITICAL and I can only wonder to myself.
“Is it really critical? Because I’m concerned my version of critical may somehow be different than your version of critical.”
Alright, so I’ve looked through the payload Zabbix sent in the alarm to VictorOps, but I have no idea what I’m looking at.
My best guess is one of two things:
- Zabbix is having a hard time keeping a pulse on the database cluster or
- Our datacenter in San Jose just broke off from California and is currently floating in the Pacific Ocean.
What’s my login to Zabbix again?
This is the first time I’ve been “on-call” since joining this startup. I’m in charge of Technical Support NOT Operations! I’m an old-skool Developer (at best). There is absolutely no reason I should be responsible for being on-call!
I have no opinion on curl versus wget or Emacs versus Vim.
I like Atom! Sometimes Sublime if I’m feeling really hip.
What’s that? Everyone’s on-call for something? Uh. OK.
Fun Fact: When you work for a small startup, you wear a lot of hats. Everyday. All at the same time. Get used to it.
I really hope LastPass cooperates with me right now. I have no idea what my Zabbix credentials are.
I barely remember the steps to triage once I’m logged in to Zabbix! I’m pretty sure step one is: “look at the logs”.
WHAT F’ing LOGS?
There’s probably a runbook for all of this somewhere, but I have no clue where to find it and no time to dig around.
Usually I see a link to the correct runbook surfaced immediately with the alert in the timeline. Not this time. WTF?
You know what? I’m just going to text my co-worker. Chances are she’s going to be the one that has to fix this shit anyway.
I can either spend the next 10 minutes trying to look for something that I may not even be able to solve or I can tuck my tail between my legs and reach out for help immediately!
I’ll send her a text and while I’m waiting for a response I’ll poke around at some things I’m comfortable touching.
At least I can say with confidence “I looked around but didn't find anything.”
She responds to the text in under a minute!
If I didn’t recognize that I’m already an HR nightmare, I’d probably make an inappropriate comment about my gratitude right now.
A comforting feeling of “it’s gonna be OK” washed over me as we moved the conversation in to our chat client.
I understand now why everyone is losing their shit over this ChatOps stuff.
I saw all of her actions unfold as though we were sitting next to each other. They show up both in my preferred chat client (Slack) and the VictorOps timeline.
She asked our chatbot to ping the database host, pulled a graph of the CPU load, and queried a log file all in the span of about 45 seconds.
And it all took place in chat. All of her actions were right there in line with our conversation!
She asked if I found anything before she hopped on.
Embarrassed, I responded with “Not a whole lot. I was pretty useless. And my brain doesn’t seem to want to function at 3:45 a.m.”
In under 4 minutes she had diagnosed and resolved the problem. San Jose hadn’t gone anywhere, but there was something putting a serious load on our database.
I know, because I watched it all happen and saw the time-stamps in chat of everything. We have a self-documented snapshot of EXACTLY what took place during the entire incident.
Not only that, but if something like this comes up again, I know the EXACT steps to perform to take care of it myself.
I learned more in 4 minutes than I had in the 4 weeks I’ve been with the company.
“We’ll have to keep an eye on this.”
“It’s not something we’ve seen before!”
That makes sense. That’s why there isn’t a runbook AND why there weren’t any awesome pieces of context included in the alert I acknowledged with one eye open.
2014 State of On-Call
According to a survey of over 500 “on-call” and DevOps professionals, only 5% of the Incident Lifecycle is the “Alarm Phase”.
Once you've acknowledged a page, now it’s time to get busy saving your infrastructure from a total meltdown.
The next three steps (Triage, Investigation, and Identification) account for 73% of the entire incident lifecycle.
It’s pretty obvious that a majority of the rest of our time is spent collaborating with others, specifically in chat.
Not only were we able to solve the problem by collaborating over the issue via chat, but so much more.
We effectively, fixed the problem, communicated what took place, taught others how to do it, and created most of the postmortem and remediation steps. And we did it all faster than what it used to take us to do just one of those steps (in many cases).
Once you’ve got a handful of things you can do from within chat, you can’t imagine doing it any other way! Especially with stuff related to incident management.
Suck it Pager!! Challenge accepted! I can handle anything you and your Chaos Monkey cousins throw at me! Bring it!
The next morning
The postmortem is dunzo!
Pretty much our entire conversation took place in our Slack channel which was synchronized with the VictorOps timeline.
Run a “Postmortem Report” to show me everything that took place between three-fucking-thirty-five a.m. and four-go-back-to-bed a.m. … and the post-mortem is basically done.
Every ChatOps command she ran and every question I had about the information. It was all part of the chat conversation… and subsequently built right in to the postmortem.
… it was all there.
every. single. thing.
Even when our CTO leaned in to the timeline to see what was going on after he received a separate alert.
Not only was he able to IMMEDIATELY get caught up on what had taken place over the last few minutes, but he passed on some important information to us as well.
We’d only speculated that the new code the Devs had pushed out to “Production” had something to do with the problem we were seeing.
Now it was confirmed!
That’s it! I’m having a talk with the team about the Dev’s carrying the pager.
It’s freaking 2015.
We told them it was coming during the ITIL Crusades.
We warned them it was part of the Agile Software Development manifesto.
And now we’re 5 years in to the DevOps part of the software delivery history.
A Blameless culture
I felt like a total n00b that on-call shift.
The following morning during our daily scrum I tried to accept all of the blame and burden for not knowing exactly what to do under the circumstances and where to look for answers.
To my great surprise, this tiny little startup was a firm believer in “Blameless postmortems” and what I thought would be an opportunity for shaming was transformed in to an opportunity for learning. A “Learning Review”.
A coordinated effort to understand what took place in as much detail and exactly how events unfolded.
By removing blame, we skipped right to a greater understanding of the facts and the specifics of what took place and in what order. Nobody witheld important information.
Blame wasn't removed to help ease tension with co-workers. (cough “the Devs”)
It was removed so that we could obtain an accurate account of exactly what took place.
So that we could find ALL of the circumstances that attributed to the situation. There’s no root cause to this particular incident any more than there is a root cause for yesterday morning’s success when our latest feature was released to the public with great fanfare.
There were many factors that played a role in that success.
Likewise, there many factors that play a role in problems, incidents, and failures as well.
A runbook and Transmogrifier rule have been created for this type of problem now.
So, in the future, if something like this happens again, the second I acknowledge the problem from my phone, all of the answers on what to do next will be immediately provided with the alert as well.
There’s more to the story of being “on-call” beyond that initial alert.
I want more than simply a messenger of bad news!
VictorOps seems to understand this pretty well. Their service can’t solve our problems for us, but they can make the process of incident management a whole lot better.
They don’t just page you and “peace out”, leaving you on your own to figure out what’s going on and what to do about it.
They stick with you through the entire incident lifecycle. From the alert … all the way to documentation.
If you’re ever on-call, you might want to check it out.