The Amazon S3 Debacle

Time for a postmortem…

I appreciate this will sound harsh, especially for those that have just pulled all-nighters dealing with it whilst I was watching Netflix (yes, really) but, contrary to a lot of noise on Twitter and indeed press articles the S3 Outage didn’t need to be nearly as fatal as it was:

  1. Only one AWS region had issues, if you’d been using S3 in any other region, you’d have slept easy. Some are reporting S3 was globally down, this is untrue.
  2. S3 is not a single server contrary to what the BBC has just reported on Business Live.
  3. There is no need to be dependent on a single S3 region.

All the sites that went down aren’t blameless, Amazon’s S3 (and many of their other services) provides facilities that would allow a site to smoothly deal with this disruption if their engineering teams had used them.

A solution outline for some of the simple cases:

  1. S3 has cross-region replication
  2. Combine with Cloudfront and Route 53 so that origin switching can be performed.
  3. Stand-up new EC2 instances via CloudFormation at point of failover.

[ You might even have gotten away with just tweaking image TTLs at your CDN ]

It’s not at all an issue of one site hosting too much of the internet as some have suggested. It is too many sites not having appropriately engineered (not just through code, by the way) reliability into their systems. We’ve seen this with previous cloud outages and indeed organisations not hosted on the cloud (consider the airline related problems of recent times).

Achieving this level of reliability costs money, I’ve no doubt some sites will have made a decision to just ride out failures rather than pay that additional price, fair enough. Equally there are others:

  1. Who have not published an SLA and allowed their customers to develop unrealistic expectations of reliability.
  2. Who have engineering teams unaware of the methods of achieving robustness in the face of simple failure scenarios such as this one.
  3. Who have engineering teams warning of reliability issues and are choosing to ignore them whilst expecting those same folks to lose sleep due to pager calls.
  4. Who act as though good customer experience is solely about feature-set.

Amazon haven’t failed (they publish SLAs and blueprints for reliability), we have. This has long been an industry weakness, it’s time we sharpened our game because the stakes are getting higher with the likes of IoT. The necessary knowledge is increasingly available:

We need to stop making excuses, educate ourselves and our customers (you have a status dashboard, right?). It wouldn’t do any harm for the press to upgrade their knowledge a little as well. What they’re publishing isn’t fake news but it is grossly inaccurate leading to focus on the wrong things and lost progress. There’s also a small group of vendors spreading FUD in order to promote their own offerings. Ignoring their questionable ethics, many of the arguments being used are factually incorrect and/or less than rigorous. Professional engineering organisations should be above this sort of thing.

[ Being completely impartial, AWS did have some problems with the accuracy of their status dashboard that were seemingly related to the S3 US-EAST-1 issues. It will be interesting to see what the postmortem says on this subject when it arrives ]


Update 2/3/17 — The postmortem is up just 48 hours after the event. Likely that’s due to the relatively straightforward root cause in this case. We’ve all fat-fingered at some point or another! There are many lessons in this perhaps most notably that even simple things can create much havoc. All the more motivation for those sites caught out to set about putting some basic contingencies in place.