Think of every high severity incident as a disaster in progress
Disclaimer: All opinions are my own
If you’re familiar with the story of the Dutch Boy with his finger in the dike let me give you a much abridged version.
Much of the Netherlands is below sea level, and susceptible to flooding. There are man made dikes around it that prevent floods. On his way to school one day a child sees a hole in the dike that’s starting to let water in. Knowing that the high pressure will quickly expand the whole and destroy the dike he stems the flow of water early by blocking the hole with his finger and sending a friend for help.
Because he got there quickly enough the adults are able to patch the hole and no real harm is done and the young lad manages to save lives and fortunes with his quick thinking.
Today I want to apply that same principle to your on call rotations and encourage you to think of every high severity incident as a disaster that’s in progress.
When you first get paged for a high severity incident it’s easy to mitigate the obvious impact, close it and move on to get back to your work. Depending on your organization that might even be the reaction leadership wants you to have because root causing high severity incidents can be time consuming. But before you do I’d encourage you to pause and put yourself in the mind set of someone who just stumbled across a boy with his finger in the dike.
You wouldn’t walk past that boy and quickly thank him, you could pretend he wasn’t there and walk by on the other side of the street, you wouldn’t ask him to be prepared to stretch out and plug other holes that might appear, and you wouldn’t round up extra volunteers to stand ready to block other holes. You should patch the current hole and seek to understand what caused it.
The same concept applies to your high severity incidents when you’re on call for a technology team. Every high severity incident — whether it’s from an alarm you created or a customer complaining — is evidence of a disaster that’s happening. It might be quiet (like one stream of water), it might take a while before it impacts others (like a leak turning into a flood), it might even slow down for a while (like the tide receding and reducing the water flow), but every high severity incident is evidence of a disaster in progress.
And just like it isn’t smart to ignore the hole in a dike or to just send people running in every direction trying to stand by every hole they see and block it, it isn’t reasonable to ignore high severity incidents without at least being able to explain what happened and why it either doesn’t need to be fixed, or why it’s impact will be limited.
This can be tough — especially if your tech team has more work than they have engineers (which is every team), but it’s important. If you get all of your engineers tied up working against individual problems eventually you’ll run out of engineers. There’s also the cost of asking engineers to go above and beyond or accept the stressful on call rotations that come from putting in the minimum effort to prevent longer term problems.
Here are a few questions to ask yourself to make sure the issue has been addressed
- “How do I know this won’t happen again in 1 hour?”
- “What metrics can I watch to tell me if this is about to happen, or is happening right now?”
- “What will I do if that metric starts climbing?”
- “Can I tie this event back to the change that caused it?”
- “Do I have enough data to convince a very skeptical engineer I’ve fixed this?”