Written By: James Novino & Anthony Hobbs
At Jet, we utilize several different monitoring solutions, which include but are not limited to NewRelic, Splunk, and Prometheus among others. We do this to have consistent alerting across our systems, and so we are aware of the problems as they arise. We then integrate all our monitoring/alerting platforms with our Pager Duty instance.
Having a single alerting system provides many inherent benefits, issue tracking, simplified notifications, maintenance windows are simplified. This post details how we manage being paged or alerted at Jet and the responsibilities of our on-call engineers. The way we do incident management has evolved and our practices and procedures of the last few years have also evolved, as we have grown our organization.
We manage how we get alerted based on a simple principle, an alert is something which requires a human to perform an action. This principle that an alert should always be actionable is fundamental to our process.
Anything else is just a notification and not an alert, which is something that we cannot control, and for which we cannot make any action to affect it. Notifications are useful, but they shouldn’t be waking people up under any circumstance in the middle of the night. This document outlines simple policies and procedures regarding how teams should respond to pages; the official documentation for pager duty influenced this blog and the corresponding internal document.
Types of Pages
PagerDuty supports three types of incident notification settings and two urgency settings. This gives us a total of 6 possible configurations, however not all make sense so we don’t use all of them (I.E. High urgency email…). Historically we have been using only two combinations:
Issues that impact developer productivity or business functions were inconsistently given high or low priority for various historical reasons. To give better definition for this (and due to developer demand) we increased the scope to three PagerDuty service configurations. The alert specifies which PagerDuty configuration(integration) are called
Below are the responsibilities for any on-call engineer at Jet:
Each of these responsibilities are broken down below:
- Have your laptop and Internet with you (office, home, a MiFi dongle, a phone with a tethering plan, etc).
- Team alert escalation happens, set/stagger your notification timeouts (push, SMS, phone…) accordingly. Note: Make sure PagerDuty texts and calls can bypass your “Do Not Disturb” settings. For iPhones, make sure your Do Not Disturb settings do not cover the hours you are on PD.
- Be prepared (environment is set up, a current working copy of the necessary repositories is local and functioning, you have configured, and tested environments on workstations, your credentials for third-party services are current and so on…)
- Be aware of your upcoming on-call time (primary, backup) and arrange swaps around travel, vacations, appointments, etc.
- Acknowledge ( Via our pager duty slack channel and on the pager-duty service ) and act on alerts whenever you can (See the Prepare section for details).
- Determine the urgency of the problem:
- Is it something that should be worked on right now or escalated into a major incident? (“production server on fire” situations. Security alerts) — do so.
- Is it some tactical work that doesn’t have to happen during the night? (for example, disk utilization high watermark, but there’s plenty of space left, and the trend is not indicating impending doom) — snooze the alert until a more suitable time (working hours, the next morning) and get back to fixing it then. Note: If an engineer decides to snooze an alert to work on it “later”, it is misconfigured. Really, the only time an engineer should have to assess the urgency of a problem is during a manual page or to assess when to raise to “all hands on deck in warridor” level.
- If the Incident is having a serious impact on business (Go to the Warridor). The Warridor is a special room that we have in each office for dealing with major production issues; these rooms allow engineering teams to coordinate with remote engineering locations efficiently. Engineers have the capabilities to enter these rooms remotely through a pager-duty bridge. It’s called a “warridor” because it used to be a hallway which would be used as a war-room for significant production incidents.
3. Check Slack for current activity. Often (but not always) actions that could potentially cause alerts will be announced there.
4. Does the alert and your initial investigation indicate a general problem or an issue with a specific service that the relevant team should look into? If so, page them! This is why we have on-call rotations. If they don’t respond, escalate.
- Follow the playbooks. Teams are expected to have playbooks created for alerts that trigger a pager-duty incident. These playbooks provide information about how to deal with the incident, remediate, and report.
- Engineers are empowered to dive into any problem and act to fix it.
- Involve other team members as necessary: do not hesitate to escalate if you cannot figure out the cause within a reasonable timeframe or if the service/alert is something you have not tackled before.
- If the issue is not very time sensitive and you have other priority work, create a JIRA ticket to keep track of it (with an appropriate severity).
- Update pager duty slack channel on resolution/mitigation.
- If a playbook doesn’t exist yet, write down a draft, share it with the team and walk through the steps to ensure the next engineers have some guidelines for the issue you already solved.
- When a particular issue keeps happening; if an alert turns out to be a non-issue the alert should be triaged as a longer term task to fix the alert/issue.
- If the information is difficult/impossible to find, write it down. Constantly refactor and improve our knowledge base and documentation. Add redundant links and pointers if your mental model of the wiki/codebase does not match the way it is currently organized.
- When your on-call “shift” ends, let the next on-call know about issues that have not been resolved yet and other experiences of note.
- At the end of your on-call rotation, you must attend the following Production Review. You should come prepared to speak on any major issue from your rotation. Note: Production review is detailed in the sections below.
- All high priority production pages generate a Prod Incident ticket in Jira. Closing out Prod Incident tickets by providing RCA’s is part of an engineers duty when on call.
- A change that impacts the schedule (adding / removing yourself, for example), engineers need to let others know since many of us make arrangements around the on-call schedule well in advance.
- Support each other: when doing activities that might generate plenty of pages, it is courteous to “take the page” away from the on-call by notifying them and scheduling an override for the duration.
Note: Production Review is a weekly meeting that we have to discuss the previous week's incidents and go over any action items. This is also the forum we use to discuss any major releases/deployments in the upcoming week that may have cross-team implications.
The goals for our RCA process that is discussed above is to ensure Root Case Analysis (RCA) is completed for all Production Incidents (All High-Urgency PagerDuty Incidents), Track & Report on the status of RCA across all teams, Link RCA to repair items (Jira Tickets). For all High-Urgency alerts, on-call engineers are expected to follow the process below for the Jira Tickets that are created.
- Each high-urgency PagerDuty page creates a Jira Ticket.
- Jira Ticket is automatically assigned to the on-call individual of respective team.
- Individual on-call investigates, takes any remedial action, and updates the ticket with status, history, etc.
- Individual on-call marks the PagerDuty incident as Resolved within PagerDuty — the Jira Incident is automatically updated to “Mitigated” status by the PagerDuty integration to Jira. (This functionality is handled through custom instrumentation by our DevOps group)
- The individual on-call marks the item as “It’s a Duplicate” in Jira when the incident is a duplicate. RCAs are not required for Duplicate incidents.
- Individual on-call must move a ticket to RCA Completed for each Jira Incident assigned to them within 3 days of mitigation. Completing the RCA includes either, writing a summary for issues that don’t have customer impact or creating a formal RCA using the Jet RCA template which contains the following required fields:
- Timeline (History of the incident).
- Impact ( Was there any customer impact? What was the impact to internal or external stakeholders. Were all users impacted for the whole event? Was all functionality impacted?)
- Findings and Root Cause ( What investigations were performed, and what the ultimate root cause was found to be)
- Mitigation and Resolution (What was done by whom and when to mitigate the impact of the issue, What was done by whom and when to resolve the issue)
- Repair Items (Links to the JIRA Repair items that address findings from the RCA)
Note: Any significant incidents (having lasting customer impact) have a dedicated post-mortem meeting to walk through the RCA.
The purpose of the Incident Commander is to be the decision maker during a major incident; Delegating tasks and listening to input from subject matter experts in order to bring the incident to resolution. If an issue is determined to be severe by the DevOps team and requires teams to get on the bridge or go to the warrior an Incident Commander must be appointed and should follow these guidelines:
- Follow Martial Law: You can make suggestions/recommendations but the Incident Commander has final say.
- Do not take any action without confirming with the Incident Commander. (repeat after me)
- Incident Commander will determine the order in which issues are to be handled.
- Any major incident should have one incident commander online.
Shifting Tier 1 Traffic to either region and back to hot/hot can be performed by a DevOps engineer. Incident Commander has approval power on shifting traffic away from degraded regions.
At Jet, we consider services and teams to be on certain tiers depending on the urgency of fixing problems and the effect on the customer.
Alert SLAs By Tier
Each tier respectively has there own set of SLA’s which represent the maximum time you are expected to actively respond to an incident (be at the computer and have started working on the issue). These SLAs need to be appropriately configured in the corresponding pager duty escalation policies.
If you like the challenges of building distributed systems and are interested in solving complex problems, check out our job openings.