S3 Outage? (we hardly noticed!)

Truly
Truly
Published in
2 min readMar 2, 2017

Feb 28th was a rough day for the cloud. Amazon Web Services experienced a major outage on one of its oldest and most reliable/popular services — S3 — taking out large chunks of the internet including consumer and enterprise applications

According to TechCrunch:

“Affected websites and services include Quora, newsletter provider Sailthru … filesharing in Slack, and many more. Connected lightbulbs, thermostats and other IoT hardware [were also] impacted, with many unable to control these devices as a result of the outage.”

Scary.

Another victim of the outage were communication apps that were built on PaaS companies like Twilio. For most of the day, web apps that leverage Web-based communication technologies couldn’t make or receive calls. Fortunately, we made it through the day largely unscratched because of some decisions we made several years ago.

How did we do it?

We’ve always been skeptical about using a PaaS solutions because communication apps are 10x more fragile than typical web apps. When a call is made, it needs to reach the party and stay connected throughout its lifetime with a very high level of performance. There are no retries, there are no ‘try reloading the page’ alerts… it just needs to work. With so many points of failure and such a narrow performance band, we decided to bring our voice infrastructure in-house and build things the right way:

  • Loadbalancing across data centers: this might seem ‘obvious’ to web developers but it’s surprisingly hard to be resilient in communications, especially if you enable complex features like conferencing in third parties.
  • Independent Microservices: by removing dependencies within our application, S3 never was a point of failure in any of our call flows.
  • 100% Feature Parity Across Devices: many communication apps offer support across multiple devices, but few allow you to truly get 100% of the functionality across all of them (mobile is no longer just the device you use in the car). By giving customers true redundancy, the small handful that were affected had a fallback option.
  • Monitoring & Status Communication: we knew something was wrong the moment S3 had issues thanks to our monitoring setup in Datadog. We also proactively messaged our clients across multiple channels before AWS recognized the failure using StatusPage.

While incidents can happen to anyone, they should always serve as a time for reflection and honest dialogue with customers. What went wrong? Why didn’t we catch this? How can we assure you that this class of problem won’t happen again in future? If this incident served as a tipping point for you to go down our path, feel free to reach out. We’d love to help!

--

--