Incidents, fixes, and the day after

Stefano Baccianella
Booking.com Infrastructure
5 min readDec 4, 2017

In Booking.com, our engineers have lots of freedom and can deploy changes 24/7 directly in the production environment.

This approach results in fast iterations and empowers everyone working at Booking.com to own their work.

However, with great power comes great responsibility. Sometimes, things go wrong. Then, we escalate.

Scaling up our response

Back in 2012, our department was small. Everyone knew what to do. We had a couple of pages on our wiki and that was enough.
Now the department consists of almost 2000 people worldwide.

Earlier this year, we refreshed our incident response guidelines. We first looked what other companies do. Then, we realized that we need to answer a fundamental question before anything else.

What is our guiding principle?

Escalations are cheap

Basically, we behave like a bunch of rock ants. When the colony senses danger, the colony rearranges into a tight formation that is ready to defend the nest.

In the same way, we encourage everybody to escalate a problem. If you are not sure whether the problem is under control, we assemble a rapid response group: the firefighters.

We say “Escalations are cheap”. In this way, we respond as fast as possible to any problem. There is no need for any doubt for anybody whether to escalate the problem.

However, too many escalations create an overload in work. Communication and “freedom to leave” are the ways to go.

Even in an all-hands-on-deck situation, you can decide whether you can help or not. You are free to join, assess the situation. You are free to leave when you can’t help or because the situation is under control.

We trust our people to make the right call and to manage their time in the best possible way.

Lifecycle of an incident

Our guiding principle teaches us how an incident evolves over time and what to do during the lifecycle of an incident.

In short, our escalations go through the following phases:

  1. Starting the escalation
  2. Gathering the firefighters
  3. Going back to business
  4. Closing the escalation
  5. Working towards the permanent solution
  6. Postmortem retrospective

Escalation

Whatever the situation is, everything starts with someone noticing a problem, something weird, something suspicious. Maybe it’s just a bad commit, or a DDOS attack, or an internet outage that lames an entire country. You name it.

Whatever the channel is, everybody can alert our firefighters to initiate an emergency response process.

Gather the team

The firefighters connect to a voice link and text chat to assess and triage the problem.

They appoint an Incident Leader and a Comms. Once the firefighters chose the Incident Leader, they follow her. The Incident Leader only leads the team and delegates all work, even the smallest job.

The Incident Leader has the following tasks:

  • Coordinating the tasks assignments
  • Listening to firefighters to make informed decisions
  • Show leadership when the situation gets tough

It’s important that the rest of the firefighters focus on the issue rather than writing emails.

For these communication tasks, we have the Comms. The Comms fulfills the following purposes:

  • Updating internal stakeholders
  • Informing the team about external feedback while the incident progresses
  • Engaging more people in the incident if necessary.

Mitigation: back in business

The biggest pitfall is to immediately investigate for the root cause. First, the doctor must stabilize the patient. Then, work on the wound.

Now, the Incident Leader comes in. The Incident Leader primarily coordinates the response to restore the service for our customers.

De-escalation

After the mitigation is in place, the Incident Leader starts the de-escalation. When Incident Leader sends the end-of-incident message, everyone who is not involved with the cleanup can leave the call. This means a relief for people who worked on the incident in out-of-hours times around the world.

Cleanup and stable fix

An incident is over only when your systems are “back to normal”. This means that the Incident Leader engages enough firefighters until the team replaces the mitigation with a more stable fix.

What if the fix requires much time? Then, the owning team works on the cleanup as a critical task. The Incident Leader then closes the incident.

Postmortem: the retrospective

Incidents are like presents: You love them as long as you don’t get the same present twice.

This is why we have an extra process: the postmortem.

Just before closing the incident, the Incident Leader assigns the post-mortem to the team that owns the most probable area of the root cause. Don’t see this assignment as a punishment. See this assignment as an acknowledgment that you are the best in that area.

The final deliverable of the post-mortem process is the Reason For Outage (RFO) document that respects the following guidelines:

  • Nobody gets the blame, no names anywhere and neutral language, not even team names!
  • The content is clear in such a way that a person with only a basic knowledge of the background, systems, and the context can understand the RFO.
  • It contains actionable follow-ups that are already assigned.
  • It has a clear timeline with all the important events, but not more than that.
  • A clear statement of the impact regardless of the loss we had.
  • The document breathes retrospective thoroughness.

Without a postmortem, you might make the following mistakes:

  • You fail to see what you’re doing right.
  • You fail to see opportunities to improve.
  • You make the same mistakes next time.

In short

The explosive growth of Booking.com leads us to change the way to handle escalations especially the most urgent escalations.

We believe in the following values.

  • Putting customers business first
    We don’t place hurdles to submit escalations. Our first priority is to bring relief to the customers.
  • Trusting our colleagues
    People know their responsibility.
  • Learning from mistakes
    The postmortem follows up, analyses, explains and doesn’t accuse, discredit, prosecute.

Around these values, we built our escalation lifecycle.

Your mileage might vary!

How do you organize incidents and outages in your company? Do you have any questions, doubts, or something to add?

Please leave a comment. We love to hear from you!

Would you like to be one of our firefighters? Take a look here!

--

--

Stefano Baccianella
Booking.com Infrastructure

Manager of Infrastructure @ Booking.com | Developer | Space enthusiast | Rick and Morty lover