RigD.io Collaborative Automation for Incident Management
What’s the status on that Incident?
SlackOps for PagerDuty Part 4
No matter how effective you are at resolving incidents, poor incident update practices lead to headaches all around. If you don’t provide an update you can be sure someone will interrupt you to ask “what’s the status on this incident?” When you have customer impacting issues, not providing timely updates can significantly hurt customer satisfaction or lead to the outright loss of customers. A less obvious consequence is duplicated or wasted effort. If others don’t know what’s going on then they are likely to expend energy chasing down the same data or wrong leads. This is especially true when multiple incidents are going on and may be related or with complex incidents that involve multiple teams.
There is really no good reason to not to be making regular incident updates, yet this happens all the time. It’s easy to forget to make an update when you are working the incident. Even if you are fortunate to have an organizational structure that allows for a comms owner for every incident there are still often cases where updates are made internally but forgotten externally, or vis versa.
So lets see how to speed those update efforts up and help ensure one never gets missed.
Step 1 Make your updates quickly right from Slack.
With PagerDuty you can post a note to an incident or you can provide a status update. There are benefits to both and both can be done with RigD in Slack. Start by typing
add pagerduty note
Then provide the incident number and your note.
Similarly for a PagerDuty incident status update start with
update pagerduty status
And again add the incident number and your status update.
One additional manual convenience we provide is our incident activities menu which appears with every PagerDuty incident feed notification or when you get incident details in Slack.
Step 2 Use RigD Automation to Open the PagerDuty Incident Slack Channel
Making those updates adhoc in Slack will definitely add a measure of convenience, but it won’t combat forgetfulness. To do that we need to set up automated update reminders. We will again use a RigD flow that makes both an incident status update and add a note, thus ensuring no one misses the latest update. We have another helpful guide to speed the setup of this. Start from the PagerDuty help by typing
help with pagerduty
Then choose the Automate Incident Updates button
Your first update should always be at a set amount of time; it’s the one update that most often gets missed or delayed while you try to validate the problem and asses the impact. Choose a time to make that first update, we recommend not more than 10 minutes.
Next you need to decide how to hand subsequent updates. Given most major incidents last for hours you will be making many follow on updates so you want to strike a balance in timing. You can also skip this input and choose the interval between updates manually after each update. This can helpful in managing that balance between over and under communication, but don’t forget to set it each time!
Finally, choose some text for the RigD alias trigger to make it easy to initiate the update automation during an incident.
You now have everything you need to never again forget to make an incident update. Let’s see how it works in practice by typing our alias text
This update sequence will kick off in a Slack thread. Why do we use threads? Using a thread for this allows you to keep it in the forefront in Slack while you engage in discussion and coordination in your primary incident channel space. This helps reduce the potential to miss making an update and also prevents your update activity from distracting others in the main channel discussion.
Automated reminders do reliably drive those incident updates and the speed and simplicity of making them right in Slack.
Now when it comes to making customer facing updates we love Atlassian StatusPage and so do a lot of our customers. So we often see this flow modified to include both an internal update to the PagerDuty incident as well as an external update to a corresponding StatusPage Incident. This easy to do with RigD and we are always available to help.
As with our previous parts lets take a look at the time savings and financial impact of this Slack based approach. Assuming a relatively simple and well understood update posting it manually in the PagerDuty UI takes about 26 seconds. The average duration for a major incident is 300 minutes, lets assume we make an update at 5 minutes, then every 30 minutes, and a final resolution update. That’s a total of 11 updates. If we are going to make both an update and post a note for completeness we are looking at a total update times of 9 minutes 32 seconds. Using an automated RigD update flow you are looking at at 3 seconds to start the flow and about 6 seconds per update for a total of just 69 seconds. Using our benchmark 7 major incidents a month and $5,600 cost per minute RigD reduces the per incident costs related to updates by an amazing $46,949. Incorporating this into our running total monthly major incident costs related to the activities discussed without RigD we have $428,585 vs just $56,186. You might be thinking that’s crazy do companies really loose that much money. Consider that Amazon lost an estimated $90m in about 75 minutes according to this Tech Crunch article. That’s $1.2 million per minute. Makes loosing 428 thousand dollars in a month seem minor. Sure none of us are Amazons size, but every dollar and minute lost matters regardless of company size.
Our final Part in the series will be out in no time so be on the lookout!