How a couple of characters brought down our site

TL;DR Summary — We accidentally submitted an incorrect change to the templating of our infrastructure provisioning system. This deleted all of the microservices responsible for serving skyscanner.net and data to our mobile app from the underlying infrastructure across the entire globe, causing a four-hour outage. We’re really sorry for the impact that caused our travellers and partners and we’ve put in several mitigations to prevent it from happening again.

Introduction

On the 25th of August, 2021 Skyscanner suffered a four-and-a-half hour full global outage. The website was unavailable and apps were unable to function, meaning that travellers and our business partners couldn’t use Skyscanner to do what they wanted to do. We’re very sorry for the problems that this incident undoubtedly caused for many people across the globe.

This document is a technical description of what actually happened with the hope that you might learn from our mistakes, find it interesting and learn a bit more about the culture within Skyscanner when things go wrong.

What is the Cells Architecture?

Our cells architecture is an approach to leverage resilience and systems theory in order to save money and allow engineers to patch and maintain the infrastructure without requiring downtime on the site.

  • A cell is in a single region and consists of several kubernetes clusters in each availability zone.
  • Traffic is prioritised inter-AZ, then inter-region before going cross-region in a failure mode.
  • Services within a cell are deployed in a “n+2” configuration — meaning we should be able to serve 100% of traffic with one cluster down due to failure, and one drained for maintenance.

We run all of these services on spot instances within AWS, saving us a massive amount of money.

What Happened?

At around 1530GMT on the 25th of August, an engineer submitted the following change to our templating system in preparation for a wider change in the future. This was intended to be a no-op (not intended to change any system behaviour).

A simple change with big implications

The change was peer reviewed, merged to master, and automatically rolled out. This file was at the root of the cells configuration and very rarely changed — but due to it’s function rolling out regional configuration, it was pushed by our system globally and immediately.

However, the lack of {{ }} meant that templating no longer applied and all namespaces which used this configuration (all of them) were reapplied and corrupted.

At 1600GMT on the 25th of August, our deployment system ArgoCD attempted to reconcile the configuration of the clusters. With no valid namespaces in the new configuration, it began the mass deletion of all services (478) in all namespaces across all AZs and regions across the world, ultimately because we told it to.

We appreciate that this error adversely impacted travellers and our partners around the world, and we worked quickly to fix this with utmost priority.

How did we resolve things?

Fortunately, we make use of GitOps to get clusters to reconcile themselves rather than needing us to push changes from a central system.

Once we got the configuration back in place for a cluster, it would reconfigure itself to match the correct state and brought itself back up. The team were well-versed in the backup and restoration process and once the scope was understood and the problem mitigated, we were able to restore reasonably quickly.

The team focused on restoring a single region and prioritising critical services so by around 2030GMT skyscanner.net was serving travellers again with all traffic being served out of a single region. At that point the engineering staff were sent home to rest and start again the next day. Over the course of Thursday and Friday, the rest of the regions and non-critical services were brought back online and verified.

Here’s a picture of our traffic in one of the affected regions to give you a sense of the impact and timescale (overplayed with previous days)

What did we learn going forward?

Once we had mitigated the problem and had got things back to normal (including catching up on any missed sleep…) we try to learn from our mistakes.

Our Incident Learning process is of several stages and is focused primarily on evaluating the technology and processes we have in place to understand what happened and how we can prevent it in the future.

  • First, we provide a very quick summary of the incident and the impact to the business so that other areas of the business can communicate with external stakeholders if required.
  • Second, we prepare the timeline of what happened and when, which really facilitates understanding. This is critical to do immediately so that we don’t lose key data through automatic retention clean-up.
  • Third, we investigate and write-up our findings in a document called an Incident Learning Debrief (ILD). Some squads will use Ishikawa thinking to determine the root causes of what happened. https://en.wikipedia.org/wiki/Ishikawa_diagram
  • Fourth, we run an ILD review with an external facilitator (usually a senior engineer from another area) to dig into the potential solutions to those problems and scope them out.

After writing our Incident Learning Debrief (ILD) we shared our conclusions with the wider business. Here were some of the main points we wanted to convey which might be useful to you and your own systems…

  • Don’t do global config deploys : Duh, right? Well, this isn’t quite as obvious as it seems. k8s is a complex system and there are many different ways to apply changes to it — in many cases we don’t do global configuration changes and have spent a great deal of time and effort to prevent them, but we didn’t perceive this particular change scenario because it happens so infrequently.
  • When you use templates and logic in configuration, it becomes code : This configuration evolved in complexity over time, with templating and logic being introduced to make things easier. However, we did not introduce testing (or even linting!) when we increased the complexity of what we were doing because we didn’t think about these configuration files as anything but config.
  • Plan for the worst disaster scenario : Our scenarios and runbooks just didn’t get aggressive enough in the scope and scale of failure. Wargaming more drastic situations would have given us an opportunity to walk through some of the “what ifs?” and make some decisions around risk mitigation. That being said, you can’t plan for everything — we just don’t think we were pessimistic enough in our planning and runbooks.
  • Verify your back-up and restore processes : Any decent systems administrator will tell you that a back-up isn’t a back-up until you’ve restored it. Thankfully our back-ups were ready to go but an IAM policy change had made them difficult to obtain at a critical time. When was the last time you restored your service from backup? And what if <enter region here> is down?
  • Refactor your runbooks : Runbooks are living documents which need constant care and attention alongside the code. On top of that though, consider the UX of documents which will be read at early o’clock in the morning by a stressed engineer. Is the context clear? Are the steps clear — even idempotent where appropriate?
  • You can go too far with automation : Did we really need to template this configuration against the regions it was to be rolled out to? If we didn’t, there was a chance of configuration drift but if we did, there’s a chance of our automation rolling out across the many regions. What’s the best balance? How might you mitigate the risk?
  • Incident Commanders rock! : In the event of an incident, someone will take on the role of an incident commander but for this incident in particular, our most experienced incident commander was on hand to manage the situation and it made such a difference. Here’s a direct quote from one of the engineers on the night…
  • “I’m frequently a cynic, but the positivity and calmness to give us the space to triage and recover from even an outage as catastrophic as this without any hint of blame was a real testament to Skyscanner’s culture.
  • I don’t think I’ve been as proud of anything during my time at Skyscanner as the full response on Wednesday night to get us back to serving travellers.”

Hopefully this quick document explains what happened during that incident and perhaps gives you some ideas on how you might avoid doing the same thing.

Any questions, please reach out to us on @SkyscannerEng and we’d be happy to chat.

About the author

Stuart Davidson is a Engineering Director in Skyscanner’s Production Platform Tribe. The tribe builds and operates a range of the systems that support Skyscanner’s product engineering, including large scale Kubernetes clusters, core AWS/web infrastructure, operational monitoring, core libraries, CI/CD and developer tools.
We aim to provide a powerful and resilient base that enables other engineers to focus on delivering awesome features to our customers: travellers.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Skyscanner Engineering

Skyscanner Engineering

4.5K Followers

We are the engineers at Skyscanner, the company changing how the world travels. Visit skyscanner.net to see how we walk the talk!