The Mon-ifesto Part 3: Alert Response and Post-Mortem

A 3-Part Guide to Better Application Monitoring

Part 3 of a 3-part series on Monitoring. You can read Part 1 and Part 2 here.

How you respond to alerts is just as important, if not more important, than how you get those alerts. It is the quick and proper reaction of engineers to issues that guarantees uptime and happy users. Every app will crash at some point because it does have bugs, your code quality scanner just missed them because it too has bugs.

For this section, instead of giving guidelines on how and when to do things, I am going to lay out a few ideas on how to respond to alerts and leave it up to you to decide what methods work best for your app and your organization.

There are two essential and closely related concepts I want to lay out here — the Formal Incident Response and the Post-Mortem.

1. Formal Incident Response (FIR)

Formal Incident Responses (FIRs) are typically reserved for high priority alarms as they involve multiple people and are highly formalized. Going through this process for medium and low priority issues can be tiresome, and in some cases simply not possible. What is a Formal Incident Response? Simply put, it’s a method of responding to alarms that gives each person participating in the response a clearly defined role and a structured hierarchy of command and control. This is extremely important — the more people you have trying to fix an issue the harder it is to fix. This is a fact of life, not just operations.

Let’s take a look at the standard FIR roles:

Incident Manager

The Incident Manager is the person in charge of the FIR. This person does not have to be a manager in title, but for the purposes of the FIR this person is in charge and is the final authority for the entire FIR. Their job is to coordinate efforts for the FIR, ensure that work is not duplicated, bring in third parties where necessary, and sign off on closing the incident.

Incident Reporter

The Incident Reporter has one job in the FIR: record everything. They are responsible for recording all important information that is written, spoken, or discovered during the FIR. A tool like Confluence or even GitHub is invaluable for recording the events of the FIR and for performing a post-mortem, which we will discuss in the next section.

Bridge Runner

The Bridge Runner is responsible for the conference bridge (if there is one). They are the one who sets it up, communicates the bridge information to all necessary parties, and engages third parties to join the bridge if needed.

Primary Engineer

The Primary Engineer is the technical point of authority and coordination for the duration of the FIR and should be considered the “second in command” to the Incident Manager. Like the Incident Manager, this person’s job function or title is irrelevant for the purpose of the FIR. However, it is helpful to place a senior engineer in this role as they are responsible for investigating resolutions to the incident and delegating tasks to the standby engineers.

All technical work flows through the Primary Engineer — other members of the FIR are not allowed to perform technical tasks of any kind without the explicit permission of the Primary Engineer. This prevents a “too many cooks in the kitchen” situation where too many people are trying to resolve the incident at a time.

Standby Engineers

Standby Engineers are a part of the FIR to assist the Primary Engineer in resolving the incident. Their work is given to them by the Primary Engineer and they should work on a per-task basis; meaning they perform a task given to them and no more. This ensures they don’t duplicate work or cause unnecessary delays in resolving the incident.

Third Party Members

Third Party Members are outside teams or groups brought in to help resolve the incident. They are typically brought in when either the issue is outside the scope of the app engineering team or the issue is with a vendor supplied product inside the app team’s scope. Third party members should coordinate through the Incident Manager.

It is important that during the FIR all parties are actively involved in the resolution effort. FIRs should be conversationally informal and a “no dumb questions” space. Sometimes the most improbable suggestions end up being right!

Now, I realize that filling all of these roles with different people may not be feasible for some teams. Some roles may be combined or eliminated as fits your team. What is important is to establish a clear chain of command and communication system between all parties in the Formal Incident Response.

2. Post-Mortem

The post-mortem takes place after a Formal Incident Response has been closed. They don’t need to happen immediately afterwards, but they should be held soon enough that the details of the FIR are still fresh. Scheduling the post-mortem for the next business day, or the day after, is a good starting point. But what exactly is a port-mortem?

Post-mortems are meetings, usually only conducted by the application team (no third parties), that reviews the events of the FIR and looks to accomplish a few goals:

  1. Identify the root cause of the incident.
  2. Generate and assign tasks to ensure the incident doesn’t happen again.
  3. Review the FIR process and discuss improvements.

It should be noted that nowhere in that list is “place blame” — accountability should not be doled out in post-mortems. If they are, they can easily devolve into a public humiliation forum, which does not encourage your engineers to show up and participate. If individuals need to be counseled for their role in an incident, it should be done in private by their supervisor.

Importantly, these post-mortems should be recorded and appended to the FIR record that was generated by the Incident Reporter. These reports form the basis for a very powerful and searchable knowledge base that your engineers and administrators can reference if a similar incident should arise in the future.

Hopefully over this series I have provided you with a powerful and liberating framework for how to monitor, respond, and asses your application and its infrastructure. While specifics will vary for different companies, teams, and apps, this three-part series can be considered a solid base for exploring how to improve your processes.

DISCLOSURE STATEMENT: These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2018 Capital One.