Incident Management: Taming the Unpredictable Beast in a Predictable Manner

Published in

Kargo Technologies

7 min readJul 15, 2022

Personal Story

It was the beginning of 2022. I have just joined Kargo Tech as Associate Staff Engineer to oversee the overall engineering team which is also a relatively new team. I was still assessing the current technology that is used as well as getting acquainted with all the different squads.

One fine day, Kargo engineering team got notified of an incident from the operations team which came directly from our client. Turns out one of our services was running slow. On-call engineers from product engineering sprung onto action and I shadowed the process.

Back then, there is no standard yet on how to handle issues. Everyone is doing their own thing to find the problem. I, as the newest guy from the bunch, did not know where to look during this incident. All the observability tools which Kargo used were still out of my radar. Hence, I asked the on-call engineers which part that need further checking and worked with them. Much to my surprise, not all of them knew where to look at either. So this problem needs to be escalated to the core team and infra team as the one who has the expertise on this issue.

After spending hours debugging and fixing, at last, the problem was resolved. The interesting part that I noticed was the fixing part itself finished in under an hour, however finding the root cause took an eternity because of a lot of back-and-forths.

Realization

“Anything that can go wrong will go wrong, and at the worst possible time.” — Murphy’s Law

As a software engineer, incident always happens unexpectedly. Anything can happen in the real world: whether it is network related issue, a software bug that happens only once in a blue moon, or hardware that crashes because of a cockroach (a real bug) that somehow slips into some data center electrical wiring.

One thing that can be noticed from the problem mentioned above:

Core & Infra team has the expertise but does not know the business context
On-call engineers from the product engineering team have domain knowledge, but not expertise.

Finding problems in a software system should not be a specialized skill that only some engineers are able to do. This skill should be democratized. At least, there should be baseline skills that every engineer should learn to identify the problem by themselves in the first minutes or hours.

Every new engineer will get hit with this problem. Sometimes incident handling cannot wait for new engineers to gain experience for incident handling, therefore it was a necessity to expedite this process with proper guidelines.

Measuring Success

First, we defined the metrics:

Time to Acknowledge: Time duration between the first incident until it gets acknowledged by an engineer for further analysis
Time to Recover: Time duration between the first incident until it gets resolved by an engineer.

The less time for both, the better. However, we put a target time for each of these metrics.

Then how do you measure this? Given our limited capability, not everything can be automated. So our team provided guidelines on how software engineers can handle incidents while measuring these metrics.

Taming the Beast

Flow Diagram on incident management standard of procedure

Incident Signal

This signal typically came from different sources. Ranked from good to bad:

Alert Notification (via Pager Duty)
Internal QA/Engineer
Kargo User

Our target is for incidents to be caught in the upper layer as much as possible. Getting into 3 is a bad experience.

Acknowledgement by On Call + Inform product

Using On Call Management, rotated every week. Ideally, our alert manager is also integrated with PagerDuty. Their escalation algorithm should be implemented in case PIC is not responsive.

The on-call engineer should inform the product team regarding the issue and move on to the next step.

Severity + Priority Assessment

The engineer should notify the product team and decide on the severity. Severity is defined by the functional impact.
However, priority is defined subjectively by the team. There may be a case where the severity is low, but the priority is high.

Example

If Company A is selling software to Company B with white-label options. Company B then wants to package Company A’s software as their own and sell it to Company C. Somehow, there is a bug that shows the logo of the Company A instead of Company B when Company C opens the software. This bug is low severity (no functional issue), but it has a high priority to be fixed since Company B may lose trust from Company C.

Severity Table — Severity table and examples

Signal Other Stakeholder

This stage is to do overcommunication with related stakeholders. This typically includes:

Other product teams
Engineers of impacted service (better if it’s on call engineer)

Which may be escalated to:

Infra Team
Core Backend Engineer

Create Emergency Conference Room

The emergency conference room was created by inviting related stakeholders for SEV 1 and 2.

Identifying Issue & Quick Fix

These are the playbook for incident investigation. These are the skills that should be democratized to all engineers in order to be able to handle incidents.

Service Analysis Playbook

The diagram above needs to be described in more detail and given a related link so that engineers can take action immediately. Be concise in the explanation. The explanation for the diagram above can be separated into 2 groups: diagnostic (rectangle) and quick fix (circle)

Diagnostic

Having an approximate time helps measure the expected time for identifying the issue and the proper fix. Each of these tools may also need to be taught to engineers, so be prepared to have a sharing session for each of these items in detail or refer them to teaching materials.

Quick Fix

Now, here’s another crucial part. Often, the decision to do a quick fix is frowned upon because of several reasons:

Not scalable
Introducing more tech debt

In the context of an incident, a quick fix is the best way to go. The business is at stake here, it may be a business that has been built for years or just built for a few months. If the incident becomes prolonged enough, clients may lose trust to the company. Rebuilding trust is more difficult than gaining a new one, so from the risk-management perspective, it is wiser to opt for a faster resolution.

However, immediate action needs to be taken for a permanent fix. A quick fix most of the time is only temporary and should be treated as such.

Post Mortem Analysis (PMA)

Post Mortem Analysis document is required for SEV 1 & 2 incidents and optional for SEV 3+. PMA document is useful for these reasons:

Record of the incident for auditability
Learning from past incidents
Retrospective improvement
Follow-up actions for a long-term fix

PMA documents format can differ but usually follow this general guideline.

Most of the time, a lot of incidents happen periodically. It will be also beneficial to share these PMA docs and present them to the engineering team

Culture

Learning opportunity
Not everything can be anticipated. Incident provides a learning opportunity for engineers to prevent future issues.
Blameless culture
Focus on problem-solving rather than pointing fingers. This allows people to be truthful in explaining the incident and its sequence of events.
Log everything
Write down details of the incident from acknowledgement until long-term resolution. If necessary, link the Post Mortem Analysis document with the Bug ticket so long-term and short-term resolution can be traced. This also allows people to learn about incident trends.
Set guidelines, but improvise case-by-case
This document shows a general guideline, but different cases may need additional action or skipping steps. But this is a guide for engineers that have no idea on where to start.
One point of contact, but shared responsibility
On-call engineers will be rotated, however, the assigned engineer may not be the best person to fix the problem. Hence, it is better for the on-call engineer to ask for help from other engineers, but he remains the point of contact for the incident