I had so much fun searching for architecture fails

How not to architect your AWS solution

Nate Aiman-Smith
RunAsCloud

--

[This is the first in what will be a series of posts about infrastructure and cloud architecture]

One of my biggest Internet time sinks is Quora; I get a ton of enjoyment out of reading informed, well-written opinions on various matters in which I have at least some interest. Besides personal interests, I also read questions and answers about startups, entrepreneurship, AWS, general tech trends, etc.

One very prominent poster on various topics is David S. Rose, who at the time of this writing has 5449 answers posted. His answers come up a lot in my feed, and when they include advice I’m almost always in agreement with it. However, one of his recent answers caught my attention. The question was “Has Amazon S3 ever lost data permanently?”, to which David’s posted original answer was “yes”, although after checking with his Director of Engineering it was qualified thus:

When someone uploads a document, we store it in S3 but save its location and associated metadata in the database in EBS. When we lost EBS (the database), we didn’t lose the actual S3 documents but we did lose the lists of metadata associating files to specific accounts[…]Therefore, from an end user perspective, we may as well have lost S3 data when we lost EBS data. That’s why we have off-Amazon backups of everything.

As an infrastructure and AWS architect this was painful to read: the system was architected in such a way that data stored in a highly persistent and resilient medium (S3) was rendered useless by the loss of data in a much less resilient medium (EBS). From a resiliency perspective, this is like putting your most important documents inside an tamper-proof, fire-proof safe, then leaving the key on top of the safe; sure, the safe’s contents are protected from fire, but the key isn’t — and without the key you can’t open the safe.

In case the description of the architecture isn’t painful enough to read, it’s capped with “that’s why we have off-Amazon backups of everything”. To further the fireproof safe analogy, this is like saying: “to prevent this problem from happening again, we’ve made copies of everything and stored that in another fireproof safe in a different location, and then every night we make a fresh set of copies and transfer that to the new safe.” I suppose it’s effective, but it’s also unnecessarily complicated and expensive.

This is a simple example in which having the system architected correctly would have saved a LOT of pain and expense down the road. Simply adding a scheduled backup of the less resilient data using AWS’s own tools would have prevented the vast majority of the data loss. Without knowing the particulars of the system, I can think offhand of 3 strategies to prevent any data loss. Of course, I can quickly come up with those ideas because I’ve seen many similar cases; the Director of Engineering is obviously a smart guy who understands what the services do, but didn’t have quite enough specific knowledge and familiarity with AWS to see the flaw in the design.

If your company wants to build a solution in AWS, make sure that your architecture has been reviewed (or, even better, designed) by a certified AWS Solutions Architect. If you want to discuss your use case or application, feel free to contact me and I’ll be happy to go over it with you.

--

--