Managing major incidents from home — What I learned from Mission Impossible

Lior Avni
Lior Avni
Apr 10, 2020 · 5 min read

Full disclosure — I am not Tom Cruise (Shocker, I know 🙃). What I am, is an Incident manager, who, not unlike Ethan hunt (portrayed by the mighty Tom), also has to deal with uncertainties in life and surprises. All the more in this day and age where COVID-19 forced us all to stay at home and be introduced to a new reality. In the following blog post, I’ll share the change our IM (Incident Management / Impossible Mission) process underwent to support working from home.

Monitoring — tweaking the “false positives”

The number one tool of an Incident Manager is an extensive monitoring system.

As you’ve read in my previous blog posts, we’ve set up a wide array of monitoring tools. COVID-19 effect on the world forced a change on the business metrics, bringing with it a lot of “false positive” alerts to the various business (and application) monitors.

While in normal times, “false positives” are a “once in a while” nuisance at best which can be dealt with as a task, nowadays, where time is a precious commodity, they are a major hassle that needs to be dealt with as soon as possible. During the last month, we carefully reviewed all our monitors, respective thresholds and business needs and applied the relevant changes to each tool. This way, we don’t have to respond to a non-existing incident during strange hours…

I also mentioned that by utilizing the out-of-the-box capabilities of most Application Performance Monitoring tools, you can also get a pre-warning on incidents (and save a lot of money for your business by doing so). Well, as Louis Pasteur said “Chance favors the prepared mind” (Or was it that bad guy from “Under Siege 2” 🤔… I get confused these days 🙂) — so we prepared and completed mapping all our external services and respective alerts.

Last item in this section is your 3rd party providers. We use several vendors for many of our services (Finance, Cloud etc…). If one of them has a major incident, it affects us immediately. We carefully reviewed all our current connections, reached out to all our vendors and synced on who is on-call for each of them.

I don’t need to quote good old Louis again, right? So let’s move on 🤓

Transparency — Don’t keep secrets

These are uncertain times. We’re all at home, some of us not always aware of what’s going on in other teams, or production affecting incidents. What became a habit in our day to day work is now a core value when working from home on a major incident: Always share the data (be it a risky deployment you need to carry out as a developer, or maintenance work you need to carry out as a DevOps engineer).

Every incident has a root cause analysis (RCA) conducted and written, every RCA is shared with everyone in the company and all are welcomed to share inputs and suggestions for improvement. When you’re together in everything, you win big — even when the mission seems impossible (see what I did here? 😁)

Communication when you can’t see anyone

In normal times, when we’ve faced a major incident, besides the usual notification channels we use to alert all stakeholders and relevant engineers, we always engage with them physically.

We all sit in the same open space and we make it a habit to work with the engineers who solve the problem, shoulder to shoulder. Today, that shoulder is a health risk, so we rely more and more on technology to help us in managing the crisis.

Let’s be honest — It’s not the same.

It can never be the same as the personal touch, the encouragement we take from our fellow engineers and the human interaction. So we need to be on our toes when alerting on an incident. Texting in Slack or Hangouts just won’t cut it anymore. Virtual “war rooms” should be set up within 10 minutes of the incident start time if the root cause is not clear yet, and actual phone calls to be made if management needs to know. It’s a whole new world out there and every second counts.

That’s a good question, what is next?

Well, we live to fight another day… take each day as the challenge that it is, strive to do our best in a new reality and beat the living cr*& out of any major incident that comes our way (pardon my language 😈).

This is a call to all my brothers and sisters who manage major incidents by day (and night) and manage their families in between: Hang in there, be brave, do what needs to be done and keep that smile on your face 😎 I have complete and utter faith in you all!

The day will come when we’re back at our NOC, safe and sound with all our monitors on the wall, so until then… this is your friendly incident manager encouraging you to stay calm and Don’t Panic… 🤡

Gett Engineering

Code, stories, tips, thoughts, experimentations from the…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store