How a couple of characters brought down our site

TL;DR Summary — We accidentally submitted an incorrect change to the templating of our infrastructure provisioning system. This deleted all of the microservices responsible for serving skyscanner.net and data to our mobile app from the underlying infrastructure across the entire globe, causing a four-hour outage. We’re really sorry for the impact that caused our travellers and partners and we’ve put in several mitigations to prevent it from happening again.

Introduction

On the 25th of August, 2021 Skyscanner suffered a four-and-a-half hour full global outage. The website was unavailable and apps were unable to function, meaning that travellers and our business partners couldn’t use Skyscanner to do what they wanted to do. We’re very sorry for the problems that this incident undoubtedly caused for many people across the globe.

This document is a technical description of what actually happened with the hope that you might learn from our mistakes, find it interesting and learn a bit more about the culture within Skyscanner when things go wrong.

What is the Cells Architecture?

Our cells architecture is an approach to leverage resilience and systems theory in order to save money and allow engineers to patch and maintain the infrastructure without requiring downtime on the site.

We run all of these services on spot instances within AWS, saving us a massive amount of money.

What Happened?

At around 1530GMT on the 25th of August, an engineer submitted the following change to our templating system in preparation for a wider change in the future. This was intended to be a no-op (not intended to change any system behaviour).

A simple change with big implications

The change was peer reviewed, merged to master, and automatically rolled out. This file was at the root of the cells configuration and very rarely changed — but due to it’s function rolling out regional configuration, it was pushed by our system globally and immediately.

However, the lack of {{ }} meant that templating no longer applied and all namespaces which used this configuration (all of them) were reapplied and corrupted.

At 1600GMT on the 25th of August, our deployment system ArgoCD attempted to reconcile the configuration of the clusters. With no valid namespaces in the new configuration, it began the mass deletion of all services (478) in all namespaces across all AZs and regions across the world, ultimately because we told it to.

We appreciate that this error adversely impacted travellers and our partners around the world, and we worked quickly to fix this with utmost priority.

How did we resolve things?

Fortunately, we make use of GitOps to get clusters to reconcile themselves rather than needing us to push changes from a central system.

Once we got the configuration back in place for a cluster, it would reconfigure itself to match the correct state and brought itself back up. The team were well-versed in the backup and restoration process and once the scope was understood and the problem mitigated, we were able to restore reasonably quickly.

The team focused on restoring a single region and prioritising critical services so by around 2030GMT skyscanner.net was serving travellers again with all traffic being served out of a single region. At that point the engineering staff were sent home to rest and start again the next day. Over the course of Thursday and Friday, the rest of the regions and non-critical services were brought back online and verified.

Here’s a picture of our traffic in one of the affected regions to give you a sense of the impact and timescale (overplayed with previous days)

What did we learn going forward?

Once we had mitigated the problem and had got things back to normal (including catching up on any missed sleep…) we try to learn from our mistakes.

Our Incident Learning process is of several stages and is focused primarily on evaluating the technology and processes we have in place to understand what happened and how we can prevent it in the future.

After writing our Incident Learning Debrief (ILD) we shared our conclusions with the wider business. Here were some of the main points we wanted to convey which might be useful to you and your own systems…

Hopefully this quick document explains what happened during that incident and perhaps gives you some ideas on how you might avoid doing the same thing.

Any questions, please reach out to us on @SkyscannerEng and we’d be happy to chat.

About the author

Stuart Davidson is a Engineering Director in Skyscanner’s Production Platform Tribe. The tribe builds and operates a range of the systems that support Skyscanner’s product engineering, including large scale Kubernetes clusters, core AWS/web infrastructure, operational monitoring, core libraries, CI/CD and developer tools.
We aim to provide a powerful and resilient base that enables other engineers to focus on delivering awesome features to our customers: travellers.

We are the engineers at Skyscanner, the company changing how the world travels. Visit skyscanner.net to see how we walk the talk!