Over the past few years I’ve been involved in many production incidents. I’ve fixed them, I’ve caused them, I’ve watched them from afar, and I’ve been an Incident Manager.
Today I’m excited to share a paper I wrote, it explains how you can establish a high severity incident management program with your team. I hope it helps bring you more reliable systems 💻 and an increase in hours of sleep 😴.
How To Establish a High Severity Incident Management Program
High severity incident management is the practice of recording, triaging, tracking, and assigning business value to…
This paper was written based on my own experiences and with input and feedback from my team at Gremlin. We have worked at a variety of companies including Amazon, Netflix, Salesforce, Dropbox, DigitalOcean, National Australia Bank and Akamai.
I’d love to hear if you implement these incident management practices with your team. My DMs are open on Twitter: @tammybutow.
Want to chat about Incident Management, Chaos Engineering, SRE?
Join our Slack community: gremlin.com/community