Written By: James Novino & Anthony Hobbs

Image for post
Image for post

At Jet, we utilize several different monitoring solutions, which include but are not limited to NewRelic, Splunk, and Prometheus among others. We do this to have consistent alerting across our systems, and so we are aware of the problems as they arise. We then integrate all our monitoring/alerting platforms with our Pager Duty instance.

Having a single alerting system provides many inherent benefits, issue tracking, simplified notifications, maintenance windows are simplified. This post details how we manage being paged or alerted at Jet and the responsibilities of our on-call engineers. The way we do incident management has evolved and our practices and procedures of the last few years have also evolved, as we have grown our organization.

Overview

We manage how we get alerted based on a simple principle, an alert is something which requires a human to perform an action. This principle that an alert should always be actionable is fundamental to our process.

Anything else is just a notification and not an alert, which is something that we cannot control, and for which we cannot make any action to affect it. Notifications are useful, but they shouldn’t be waking people up under any circumstance in the middle of the night. This document outlines simple policies and procedures regarding how teams should respond to pages; the official documentation for pager duty influenced this blog and the corresponding internal document.

Types of Pages

PagerDuty supports three types of incident notification settings and two urgency settings. This gives us a total of 6 possible configurations, however not all make sense so we don’t use all of them (I.E. High urgency email…). Historically we have been using only two combinations:

Image for post

Issues that impact developer productivity or business functions were inconsistently given high or low priority for various historical reasons. To give better definition for this (and due to developer demand) we increased the scope to three PagerDuty service configurations. The alert specifies which PagerDuty configuration(integration) are called

Image for post

On-Call Responsibilities

Below are the responsibilities for any on-call engineer at Jet:

Each of these responsibilities are broken down below:

Prepare

Triage

3. Check Slack for current activity. Often (but not always) actions that could potentially cause alerts will be announced there.

4. Does the alert and your initial investigation indicate a general problem or an issue with a specific service that the relevant team should look into? If so, page them! This is why we have on-call rotations. If they don’t respond, escalate.

Fix

Improve

Support

Note: Production Review is a weekly meeting that we have to discuss the previous week's incidents and go over any action items. This is also the forum we use to discuss any major releases/deployments in the upcoming week that may have cross-team implications.

RCA Process

The goals for our RCA process that is discussed above is to ensure Root Case Analysis (RCA) is completed for all Production Incidents (All High-Urgency PagerDuty Incidents), Track & Report on the status of RCA across all teams, Link RCA to repair items (Jira Tickets). For all High-Urgency alerts, on-call engineers are expected to follow the process below for the Jira Tickets that are created.

Note: Any significant incidents (having lasting customer impact) have a dedicated post-mortem meeting to walk through the RCA.

Incident Commander

The purpose of the Incident Commander is to be the decision maker during a major incident; Delegating tasks and listening to input from subject matter experts in order to bring the incident to resolution. If an issue is determined to be severe by the DevOps team and requires teams to get on the bridge or go to the warrior an Incident Commander must be appointed and should follow these guidelines:

Shifting Tier 1 Traffic to either region and back to hot/hot can be performed by a DevOps engineer. Incident Commander has approval power on shifting traffic away from degraded regions.

Tiers

At Jet, we consider services and teams to be on certain tiers depending on the urgency of fixing problems and the effect on the customer.

Image for post

Alert SLAs By Tier

Each tier respectively has there own set of SLA’s which represent the maximum time you are expected to actively respond to an incident (be at the computer and have started working on the issue). These SLAs need to be appropriately configured in the corresponding pager duty escalation policies.

If you like the challenges of building distributed systems and are interested in solving complex problems, check out our job openings.

James Novino is a OMS (Order Management System) engineer who is championing QA (Quality Assurance) practices.

Anthony Hobbs is a DevOps architect who is championing complete automation through CodeOps.

Written by

engineer @capsule

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store