How To Establish a High Severity Incident Management Program

Tammy Butow
Jan 19, 2018 · 1 min read

Over the past few years I’ve been involved in many production incidents. I’ve fixed them, I’ve caused them, I’ve watched them from afar, and I’ve been an Incident Manager.

Today I’m excited to share a paper I wrote, it explains how you can establish a high severity incident management program with your team. I hope it helps bring you more reliable systems 💻 and an increase in hours of sleep 😴.

This paper was written based on my own experiences and with input and feedback from my team at Gremlin. We have worked at a variety of companies including Amazon, Netflix, Salesforce, Dropbox, DigitalOcean, National Australia Bank and Akamai.

I’d love to hear if you implement these incident management practices with your team. My DMs are open on Twitter: @tammybutow.

Want to chat about Incident Management, Chaos Engineering, SRE?
Join our Slack community:


Tammy Butow

Written by

Principal Site Reliability Engineer @GremlinInc | Chaos Engineering ☁️ 💻 ⚡️💀 Previously @DigitalOcean @Dropbox @NAB @QUT