SaaS DevOps: Incident Response 101

Jaroslav Gergic
5 min readOct 21, 2022

--

A couple of years ago on a cloudy winter afternoon I was lying in bed with a fever, having a nap and trying to get better. Suddenly a phone rang and woke me up. I picked up the call and spoke to a technical support engineer claiming there is an issue with our production system. He filed for a P1 ticket an hour ago, but no one responded yet. Also, he could not reach the engineering manager, hence he decided to escalate the issue…

What went wrong?

I got up from bed, turned on my laptop and connected to VPN. I found the relevant JIRA ticket and reached out to the R&D team responsible for the data ingest service over IM. I found out that they were already testing the fix in the staging environment and were about to deploy the hot fix to production in the next hour.

So, what went wrong? When they received the ticket from the technical support engineer, they simply swarmed around the problem and started working on resolving it immediately. What they did not do though was to accept the ticket in JIRA and provide a status update to the reporter. Thus, he concluded that we were not aware of the reported issue and started escalating over the phone.

Communication!

This lesson clearly demonstrates that when it comes to incident response communication is equally important as action. It is so easy for the R&D teams to jump right into the problem solving while forgetting about communication. Not surprisingly, it is often quality of communication which determines how customers and stakeholders experience and perceive the success or failure of an incident response situation.

To be fair, it is not easy to communicate amid the adrenalin rush while trying to resolve a production emergency as fast as possible. It is also an issue of focus: one can’t focus 100% on troubleshooting and problem solving while switching context to communicate with internal and external stakeholders. Inevitably, the person in charge might feel like communication will slow down incident resolution and therefore consciously decides to avoid this distraction.

Therefore, the incident response play needs to be a carefully orchestrated team effort. Part of the team needs to focus 100% on resolving the incident while the other part of the team needs to communicate and provide updates.

Pair programming — Wikipediaen.wikipedia.org

Pair programming in agile software development bears many analogies to incident response.

The Magic Number

Obviously, the minimum number of people to for an incident response team is two. In analogy to pair programming paradigm, one person would be the incident responder (driver), while the other would be the observer / reporter.

However, based on my experience, the magic number for handling small-scale production incident response in a professional manner is three:

  1. responder (a.k.a. driver / pilot)
  2. observer (a.k.a. navigator)
  3. reporter / coordinator

All three team members work in tandem as outlined in the following section.

Simplified schema or roles and communication channels during incident response.
Simplified schema or roles and communication channels during incident response.

Responder

The responder is the primary person doing troubleshooting and eventually fixing the issue by either producing a hot-fix or implementing a configuration change in the production system.

Observer

Observer, also known as navigator in pair programming, observes the incident responder’s work, either physically, over the shoulder, or virtually via a screenshare and voice chat. Observer provides the second pair of eyes to help troubleshooting but even more importantly to prevent fatal mistakes during hectic, often stressful, and usually outside of the business hours incident response situation.

The observer plays a secondary role in incident response, which is to capture what’s happening during the response, and relaying that mostly technical information unfiltered to an internal communication channel so that anyone on the team can see what is happening and allows more team members to join the effort and get up to speed quickly if needed.

Reporter / Coordinator

Reporter / coordinator (sometimes also called manager on duty) is the third person involved in the incident response. One of his roles is to take care of stakeholder communication: stay on top of the unfiltered event stream produced by the observer in the internal channel and based on the full picture of the situation compile and publish periodical status updates for stakeholders to the external communication channel.

The aim is to keep the company management team, other functional units (such as customer support), and eventually customers up to date. The reporter’s role is to also keep all the relevant tracking systems such as JIRA or support CRM updated with the latest status update to prevent duplicate filings and redundant status inquiries from customers and support.

The second role of the coordinator is to assess the business impact of the incident and drive the response (and communication) with clear understanding of business impact in mind. In other words, while the incident responder and observer consider primarily technical aspects, the coordinator should represent the business aspects and help responder(s) to prioritize actions according to their business impact. Eventually, the coordinator might decide to involve additional team members as needed or escalate the incident and invoke full-blown business continuity / disaster recovery (BC/DR) plan with rotating shifts and follow-the-sun schema if the incident situation requires such measures.

Takeaways

Communication is equally important as action during incident response. While communicating, it is important to separate detailed internal communication from externally facing status updates which require different cadence and level of detail. It is also important to balance both the technical and business aspects during an incident response situation. My experience shows that given the above, gathering a three-person team is a reasonable minimum for handling a production incident response situation in a professional manner.

Past Articles

Consider checking out past articles from my series on software development:

Why SaaS and Cloud Computing make IT fun againjgergic-tech.blogspot.comWhy SaaS and cloud attract the best talent in the IT industry.

Agile on Overdrivejgergic-tech.blogspot.comWhen “agile” goes wrong and becomes and becomes a hindrance to the actual progress.

Version 2.0 Syndrome — Why the Software Architecture Mattersjgergic-tech.blogspot.comAbout the fallacy of starting from scratch rather than evolving and existing product offering.

--

--

Jaroslav Gergic

Always busy building the next big thing, now living in the confluence of cybersecurity, machine learning, and cloud computing.