AWS Outage + UberOps
In case you haven’t heard, a major outage from the leading cloud experts at Amazon Web Services (AWS) has got us on our toes. We can’t help to think it went something like this:
So What Happened?
According to Amazon representatives, a command was executed with the intent to remove a “small number of servers” from an S3 subsystem related to AWS billing (1). Long story short, more than a small number of servers were removed. They say to err is human.
For a little backstory about us and why this matters, UberOps is a company of Data Integration, Cyber Security, and Cloud Computing experts, specializing in Healthcare. We are an AWS Public Sector Partner company, a certified AWS GovCloud Partner, and manage hundreds of healthcare data exchange servers within the AWS cloud.
Throughout this fiasco, we experienced about 4 hours of downtime.
Let’s Go Back to Tuesday, February 28th.
A small and manageable outage was felt near the end of the week of February 20th, impacting several client health data systems. Looking back with present-day hindsight, Blake and Andrew, Data Protection Officers from the UberOps Security Network Operations team, chuckle fondly on its triviality. As for the outage that earned national attention, when and how did our SecOps team notice something was wrong? They admit that the two outages started to merge in their memory, but thanks to comprehensive logs and issue tracking, it only took a few searches to get the full rundown.
12:37PM EST: The fatal S3 server removal command. See graphic above.
12:45PM EST: Philip Morrison, UberOps Chief Security Officer, alerts the troops, who are out to lunch, of a problem with a log server via text message. Check please.
1:17PM EST: Phil alerts the company internally and creates a static page for push updates on the situation for clients. You rock, Phil!
1:23PM EST: Network Operations receives a ticket informing them that all transmissions received today within a nationwide statistics and documentation network were being truncated. Not ideal.
2:00PM EST: GoToMeeting conference lines begin failing, and Vitru encrypted emails stop sending. Uh oh.
2:36PM EST: Website for our Newborn Screening platform Bloomdot goes down. Don’t panic!
2:50PM EST: The first phone call from an UberOps family member, “is this cloud outage I’m seeing on the news affecting you at work?”. Yes it is, honey.
It is remarkable the pace at which the roman columns begin to tumble when everything is connected through reliance on something like S3.
So After All is Said and Done, is All of This Ok?
Thankfully, we are forward thinking about this kind of event. Though this is the first outage of this magnitude, it is not the first of its kind, so we have already had chances to learn a thing or two about reliability and availability.
What’s the worst that could happen? is not just a sitcom catch phrase, it’s a question that the folks responsible for hardening and guaranteeing these systems ask themselves on a regular basis. Chief Technology Officer Frans de Wet shares a cautionary example,
“This time it was a four hour outage, but who knows what kind of future typo could wipe out the US-EAST 1 region for good. That is why we have all of our infrastructure replicated in active or passive mode in other regions dependent on availability and recovery requirements”.
He explains that events like this force us to learn how to “balance between being “always available” and being prepared to take a hit for an outage”.
Let’s Dive Even Further.
While the outage affected some of the world’s most beloved platforms, Netflix, MailChimp, and Twitter, when it comes to the systems for whose integrity we are responsible, the stakes are a little higher than movies and coupons. After all, a downtime of four hours is not ideal, but it is certainly manageable. For some of the systems running within the environments we manage, a downtime lasting longer than 8 hours can mean delayed results, which shape healthcare decisions for people who do not have time to wait around for S3 to begin functioning again. When asked about the pressures of that, Network Operations did not hesitate to share their confidence in their work; they had no control over how long the outage would be, but no Critical Data was ever at risk.
What Happens With Client Work When Outages Like This Happen?
For good and obvious reasons, UberOps is always working to improve scalability and reliability for clients through researching and implementing the newest available tools and best practices. For the same reasons, clients always have questions about how things can be made better for them. Frans de Wet reflects on his experiences,
“I can remember on several occasions being asked how can we expand availability? You have to remind them that four hours downtime in a year is about as good as you can ask for”.
Network Operations echoes, confirming that we have always done what customers have wanted and have been willing to pay for. Further risk analysis will show what some organizations just don’t want to hear — that there is always more that can be done, but data redundancy comes at a cost.
So What is the Big Picture?
Diversifying and safeguarding your cloud environment doesn’t have to mean hugging servers, as some would suggest. We have all of our infrastructure replicated in other regions in preparation for events like, and worse than, this one. AWS is still the best option out there for security, reliability, and cost, so treat this outage as a learning experience to start asking the right architectural questions that will protect your data and operations.