It’s 3am. I must be lonely.

Michal
PagerTeam
Published in
3 min readMar 5, 2019

You are about to dock your boat at your private island in the Bahamas when the wheel makes a strange buzzing noise you hadn’t heard before. You make a mental note to have your mechanic check it out before your next sailing. It buzzes again. Stops. Buzzes. Stops. Buzzes, and shakes this time. There is no boat. You slowly come to the realization it’s your phone. It’s been ringing for a few minutes now. You wake up from your pleasant dream and it’s still ringing. There’s an emergency at work and you’re getting paged. It’s 3am. You stumble out of bed and work to get online. It’s just you. What do you do?

The happy path

  1. Acknowledge the incident to stop it from escalating
  2. Follow your team’s runbook that has been meticulously crafted with detailed step-by-step instructions on how to resolve the very issue you’re currently seeing
  3. Make a mental note to question why the runbook hasn’t been automated yet in the next standup and go back to bed.

The unhappy path

  1. Acknowledge the incident to stop it from escalating
  2. Remain calm
  3. Formulate a hypothesis around a likely cause. Apply Occam’s razor: simpler explanations are more likely to be correct.
  4. Validate your hypothesis with metrics
  5. Attempt to bring back functionality — although perhaps degraded; your goal here should be the restoration of critical functionality at the expense of nice-to-haves — by doing one or more of:
  • rebooting a machine or fleet
  • scaling (deploy additional machines or adjust provisioned capacity on services such as DynamoDB)
  • changing configuration settings (eg, turn off feature flags, change connection pools, etc)
  • rollbacking recent deployments

You should be applying band-aid fixes and not addressing the root cause. Before you take action, you should avoid making things worse (“do no harm”). As a rule of thumb you should not:

  • write new code
  • change existing code
  • address the root cause

You will be paged for areas outside of your expertise. Your fundamental role should be to triage the incident, keep things running as best you can, and engage others as appropriate.

When to escalate further

  • If your team’s SLA will be exceeded before you can resolve the incident (many teams will have two different SLAs: a shorter one for really serious problems, and a longer one for less serious issues). If you’re not sure or your team doesn’t have one, use 30 minutes as a rule of thumb
  • If customers or other teams are actively asking for updates, and you are too busy fixing the issue to respond. (Plural: if just one customer or one other team is inquiring, this rule does not apply)
  • If you are unable to formulate a hypothesis as to what might be wrong, or if you are unable to resolve the issue without writing new code
  • If you think you need help or a second set of eyes

Additional tips

Keep notes. What you see, what you try, and what the result is. You might be asked to explain why you took the steps you did, and you should be able to justify your thought process. Notes will also help someone ramp up to speed so they can help if you need to escalate.

Avoid blaming others. You’re all in this together, and next time it could be your code that breaks. Focus on a solution instead.

--

--

Michal
PagerTeam

Founder @PagerTeam, formerly @MSFT, @AMZN, @IMDb, @HBO, @BuiltForMe