Meeting Our SLAs by Leveling Up Our Incident Management Process

Kamil Sindi
JW Player Engineering
5 min readApr 21, 2020

Many publishers and broadcasters rely on JW Player to help reach their millions of viewers and monetize their content. As such, it is imperative that JW Player maintains high Service Level Agreement (SLA) levels and resolve issues in a timely fashion.

Without clearly defined channels of communication, key stakeholders can be left in the dark when an incident arises. Different Slack conversations can happen in multiple places and important information can be buried or siloed.

A couple of years ago, we were experiencing this issue of siloed communication. In response, we created a cross-team working group to define our incident response protocols and improve resolution time.

Instilling a culture of incident response and communication

From the working group’s experience at previous companies, we knew that defining a process was not enough: incident response had to be ingrained in our organization’s culture.

We outlined the key tenets of our process:

  1. Every breakage in SLO should be treated with highest priority.
  2. Technical resolution and communication should be two separate roles since they are equally time consuming and valuable.
  3. A good incident process should be simple enough for on-call responders to follow under stress.
  4. Post-mortem meetings should be blameless; they are intend to help the business identify and resolve gaps in processes.
  5. During an incident, everyone should bias towards over-communicating since context can easily get lost.
  6. It is better to start the incident process even if the severity is not clear.
  7. Communication around updating incident status should consolidate to a pre-defined channel that stakeholders can subscribe to.

Defining the incident roles

There are two major tasks in incident management: technical resolution and communication. Both are time consuming and require very different skill sets. As such, we define two roles in our process: the Technical Owner and Incident Delegate:

  • The Technical Owner (TO) is responsible for correcting the issue at hand or providing a workaround which enables functionality that approximates normal service. They ensure that all incidents received are reported to the Incident Delegate.
  • The Incident Delegate (ID) is responsible for gathering information from the TO, communicating to the rest of the organization, especially Support, and helping the Technical Owner escalate to other team members or vendors. They become the highest ranking individual on any major incident, regardless of their title. They have authority to make sure the Technical Owner is not being bothered by others, even if the question is coming from the CTO.

Incident Delegates are volunteers. Everyone in the company is welcome to join. We target 10 to 15 members so that the weekly on-call rotation is only once a quarter. Many members find it to be a good experience to learn how other teams’ systems work.

Keeping things simple with runbooks

A good incident process should be simple enough for people to follow under stress, but broad enough to work for the variety of incident types you will encounter.

To that end, we created runbooks for each role to ensure someone in the midst of the incident doesn’t have to go through paragraphs or complicated flow charts.

Our runbooks read like an algorithm:

  1. Acknowledge the incident.
  2. Does the incident impact the customer experience with no acceptable workaround?
  3. If Yes, join the #war-room channel in Slack and type `@delegate` to start the incident and escalate to the Incident Delegate.
  4. If Slack is down, join the Google Meet channel.
  5. Etc.

Moreover, during the beginning of an incident, our Delegate asks the Technical Owner the following questions:

1. What is the impact to customers? Please describe it in a way that a customer would understand.

2. Which services are impacted?

3. When did the event first occur to the best of our knowledge?

4. How can we quantify the impact/severity of the incident? Example: percentage of requests returning 5XX errors.

5. What is our best estimate of time to resolution?

Other operational considerations

We also documented the following requirements to make sure PagerDuty was consistent and it was easy to figure out which team’s services were impacted:

  1. The PagerDuty mobile app is installed in every team member’s phone.
  2. Every incident schedule has a tier-2 escalation policy for redundancy. Escalation trigger to level 2 tier after 15 minutes without acknowledgement.
  3. Every team has an incident escalation email so that Support and other stakeholders can email to page someone without figuring PagerDuty.
  4. Monitors that page the team are encouraged to have runbook links included so that the person on-call can triage easily.
  5. We have a “list of services” that Support can reference to figure out ownership.
  6. Every service should have an availability Service Level Objective (SLO) as well as other metrics that describe expected user experience.

Learning from our mistakes with a blameless post-mortem culture

At the end of the incident, the Incident Delegate and Technical Owner must write a Post-Incident Technical Summary (PTS) within a business day of the event. The purpose of this document is to provide internal stakeholders with the answers to questions that they might be asked before a formal root cause document is published.

It is very important that these reports are not about placing blame. They are intended to be used to help the business identify gaps in processes, and to be used for the improvement of the organization’s response to incidents. The more detail the better, as these documents can be referred to during future incidents.

Key items in our PTS document:

  1. Incident description.
  2. Incident duration.
  3. Customer impact.
  4. A detailed timeline of the incident.
  5. Action items with linked Jira tickets.
  6. Whether the outage is third-party related.

Results

Since rolling out this process, we have seen a decrease in the number of incidents reported as well as incident duration. During the most recent quarter, we saw a 30% decrease in the number of incidents and a 40% decrease in the median incident duration compared to the quarter when we started the process. While correlation is not causation, we think some of the improvement can be attributed to making sure we completed post-mortem action items.

Moreover, our Support team has been very happy with the level of communication. We have not had an incident since starting this process where Support was unaware of an ongoing issue that Engineering was in the middle of resolving.

Thank you

I want to thank all our incident delegates, on-call engineers, support, the working group and former JW Players, who pushed for an incident response culture.

--

--