How a couple of characters brought down our site

Introduction

What is the Cells Architecture?

  • A cell is in a single region and consists of several kubernetes clusters in each availability zone.
  • Traffic is prioritised inter-AZ, then inter-region before going cross-region in a failure mode.
  • Services within a cell are deployed in a “n+2” configuration — meaning we should be able to serve 100% of traffic with one cluster down due to failure, and one drained for maintenance.

What Happened?

A simple change with big implications

How did we resolve things?

  • First, we provide a very quick summary of the incident and the impact to the business so that other areas of the business can communicate with external stakeholders if required.
  • Second, we prepare the timeline of what happened and when, which really facilitates understanding. This is critical to do immediately so that we don’t lose key data through automatic retention clean-up.
  • Third, we investigate and write-up our findings in a document called an Incident Learning Debrief (ILD). Some squads will use Ishikawa thinking to determine the root causes of what happened. https://en.wikipedia.org/wiki/Ishikawa_diagram
  • Fourth, we run an ILD review with an external facilitator (usually a senior engineer from another area) to dig into the potential solutions to those problems and scope them out.
  • Don’t do global config deploys : Duh, right? Well, this isn’t quite as obvious as it seems. k8s is a complex system and there are many different ways to apply changes to it — in many cases we don’t do global configuration changes and have spent a great deal of time and effort to prevent them, but we didn’t perceive this particular change scenario because it happens so infrequently.
  • When you use templates and logic in configuration, it becomes code : This configuration evolved in complexity over time, with templating and logic being introduced to make things easier. However, we did not introduce testing (or even linting!) when we increased the complexity of what we were doing because we didn’t think about these configuration files as anything but config.
  • Plan for the worst disaster scenario : Our scenarios and runbooks just didn’t get aggressive enough in the scope and scale of failure. Wargaming more drastic situations would have given us an opportunity to walk through some of the “what ifs?” and make some decisions around risk mitigation. That being said, you can’t plan for everything — we just don’t think we were pessimistic enough in our planning and runbooks.
  • Verify your back-up and restore processes : Any decent systems administrator will tell you that a back-up isn’t a back-up until you’ve restored it. Thankfully our back-ups were ready to go but an IAM policy change had made them difficult to obtain at a critical time. When was the last time you restored your service from backup? And what if <enter region here> is down?
  • Refactor your runbooks : Runbooks are living documents which need constant care and attention alongside the code. On top of that though, consider the UX of documents which will be read at early o’clock in the morning by a stressed engineer. Is the context clear? Are the steps clear — even idempotent where appropriate?
  • You can go too far with automation : Did we really need to template this configuration against the regions it was to be rolled out to? If we didn’t, there was a chance of configuration drift but if we did, there’s a chance of our automation rolling out across the many regions. What’s the best balance? How might you mitigate the risk?
  • Incident Commanders rock! : In the event of an incident, someone will take on the role of an incident commander but for this incident in particular, our most experienced incident commander was on hand to manage the situation and it made such a difference. Here’s a direct quote from one of the engineers on the night…
  • “I’m frequently a cynic, but the positivity and calmness to give us the space to triage and recover from even an outage as catastrophic as this without any hint of blame was a real testament to Skyscanner’s culture.
  • I don’t think I’ve been as proud of anything during my time at Skyscanner as the full response on Wednesday night to get us back to serving travellers.”

About the author

--

--

--

We are the engineers at Skyscanner, the company changing how the world travels. Visit skyscanner.net to see how we walk the talk!

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Nginx: From Idea to $670M Enterprise value

Self-Ranking Search with Elasticsearch at Wattpad

Creating an Auto Scaling Group Using the Command Line Interface

What Makes Go So Different?

WePiggy-Polygon “WePiggy’s Billionaires” Activity Has Ended

Finding Similar Documents: Set Representation for Documents [Part 3]

Things that don’t work anymore in DC Proof

Code from the Crypt: The Value of Learning Legacy

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Skyscanner Engineering

Skyscanner Engineering

We are the engineers at Skyscanner, the company changing how the world travels. Visit skyscanner.net to see how we walk the talk!

More from Medium

Low code microservices with node-red

Scaling Kubernetes to Over 4k Nodes and 200k Pods

My first months at Dynatrace, from Graz to Gdansk and then Linz