Managing major incidents from home — What I learned from Mission Impossible
Full disclosure — I am not Tom Cruise (Shocker, I know 🙃). What I am, is an Incident manager, who, not unlike Ethan hunt (portrayed by the mighty Tom), also has to deal with uncertainties in life and surprises. All the more in this day and age where COVID-19 forced us all to stay at home and be introduced to a new reality. In the following blog post, I’ll share the change our IM (Incident Management / Impossible Mission) process underwent to support working from home.
Monitoring — tweaking the “false positives”
The number one tool of an Incident Manager is an extensive monitoring system.
As you’ve read in my previous blog posts, we’ve set up a wide array of monitoring tools. COVID-19 effect on the world forced a change on the business metrics, bringing with it a lot of “false positive” alerts to the various business (and application) monitors.
While in normal times, “false positives” are a “once in a while” nuisance at best which can be dealt with as a task, nowadays, where time is a precious commodity, they are a major hassle that needs to be dealt with as soon as possible. During the last month, we carefully reviewed all our monitors, respective thresholds and business needs and applied the relevant changes to each tool. This way, we don’t have to respond to a non-existing incident during strange hours…
I also mentioned that by utilizing the out-of-the-box capabilities of most Application Performance Monitoring tools, you can also get a pre-warning on incidents (and save a lot of money for your business by doing so). Well, as Louis Pasteur said “Chance favors the prepared mind” (Or was it that bad guy from “Under Siege 2” 🤔… I get confused these days 🙂) — so we prepared and completed mapping all our external services and respective alerts.
Last item in this section is your 3rd party providers. We use several vendors for many of our services (Finance, Cloud etc…). If one of them has a major incident, it affects us immediately. We carefully reviewed all our current connections, reached out to all our vendors and synced on who is on-call for each of them.
I don’t need to quote good old Louis again, right? So let’s move on 🤓
Transparency — Don’t keep secrets
These are uncertain times. We’re all at home, some of us not always aware of what’s going on in other teams, or production affecting incidents. What became a habit in our day to day work is now a core value when working from home on a major incident: Always share the data (be it a risky deployment you need to carry out as a developer, or maintenance work you need to carry out as a DevOps engineer).
Every incident has a root cause analysis (RCA) conducted and written, every RCA is shared with everyone in the company and all are welcomed to share inputs and suggestions for improvement. When you’re together in everything, you win big — even when the mission seems impossible (see what I did here? 😁)
Communication when you can’t see anyone
In normal times, when we’ve faced a major incident, besides the usual notification channels we use to alert all stakeholders and relevant engineers, we always engage with them physically.
We all sit in the same open space and we make it a habit to work with the engineers who solve the problem, shoulder to shoulder. Today, that shoulder is a health risk, so we rely more and more on technology to help us in managing the crisis.
Let’s be honest — It’s not the same.
It can never be the same as the personal touch, the encouragement we take from our fellow engineers and the human interaction. So we need to be on our toes when alerting on an incident. Texting in Slack or Hangouts just won’t cut it anymore. Virtual “war rooms” should be set up within 10 minutes of the incident start time if the root cause is not clear yet, and actual phone calls to be made if management needs to know. It’s a whole new world out there and every second counts.
That’s a good question, what is next?
Well, we live to fight another day… take each day as the challenge that it is, strive to do our best in a new reality and beat the living cr*& out of any major incident that comes our way (pardon my language 😈).
This is a call to all my brothers and sisters who manage major incidents by day (and night) and manage their families in between: Hang in there, be brave, do what needs to be done and keep that smile on your face 😎 I have complete and utter faith in you all!
The day will come when we’re back at our NOC, safe and sound with all our monitors on the wall, so until then… this is your friendly incident manager encouraging you to stay calm and Don’t Panic… 🤡