Yes, S3 was down for hours. Don’t make expensive decisions because of it.

Ben Kehoe
HackerNoon.com
3 min readMar 1, 2017

--

As always happens when a major cloud provider has a significant outage, the #apocalypS3 has people claiming everyone should migrate to a different provider, adopt a multi-cloud strategy, or at the very least implement multi-region failover logic. These are all flawed, short-sighted suggestions.

First, migrating to a different cloud provider doesn’t solve anything. Every cloud provider has outages. There are ways to mitigate the impact of these outages (more on that in a bit), but if you’re reliant on a single cloud provider and your application goes down when subject to a regional service outage, you aren’t reducing risk by porting that architecture to a different provider. And remember that SLAs aren’t performance guarantees, they are financial guarantees — they define how much less you’ll owe if the performance isn’t met.

Given that, you might think adopting a multi-cloud strategy is good bet. And it absolutely mitigates the risk of even a global outage by one provider. But multi-cloud architectures are hard. If you’re running everything in VMs, then sure, EC2 and GCE are mostly equivalent, but you’re still doubling the ops burden of monitoring the underlying services. And the minute you start using PaaS, FaaS, or SaaS, all of a sudden you’ve got different APIs with different feature sets to deal with, and you lose the ability to exploit any functionality they don’t have in common. It’s true that GCS has an S3-compatible API, but what about lifecycle policies, event hooks, and logs? This isn’t to say that a multi-cloud strategy isn’t ever a good idea. Netflix has very good reasons to implement it — but chances are, your company has neither the same challenges nor the same resources.

So, you should probably stick with your current cloud provider. But you should at least implement a multi-region strategy, right? This all depends. Maintaining a multi-region deployment that you can swap out live is less work than a multi-cloud deployment, but is still non-trivial. And it comes with associated costs! Live cross-region replication of a DynamoDB table isn’t cheap. It’s almost surely worth it to implement cross-region disaster recovery (i.e., backups, etc.), but before going the route of multi-region live deployments, consider the business impact of major cloud outages. Yesterday, half the internet was barely functioning because of the S3 outage. Did you lose business because of it? Would you have gained new customers or traffic if you had been up during those few hours? How much is that worth? Then compare that to the development and operations costs of a multi-region deployment, and make an informed decision.

An alternative to multi-region is just to move out of us-east-1 and into one of the other of the big four regions (us-west-2, eu-west-1, ap-northeast-1), or even one of the other regions closer to your customers (assuming it has all the services you need). While it feels like they experience fewer issues, that could just be an issue of visibility, and I don’t have the data to determine one way or another.

I understand the urge to freak out. S3 was down in us-east-1! It took down half of AWS’s other services with it! Everything was on fire! But it’s worth taking a step back and evaluating the business impact, the low frequency of these events, and the cost of mitigation strategies before choosing to implement mitigation measures — and remember that being unavailable during a major service outage can be a valid choice, too.

Hacker Noon is how hackers start their afternoons. We’re a part of the @AMIfamily. We are now accepting submissions and happy to discuss advertising & sponsorship opportunities.

To learn more, read our about page, like/message us on Facebook, or simply, tweet/DM @HackerNoon.

If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!

--

--