Everything is on fire…and that’s OK: HA architecture and disaster recovery with AWS

Jedd Shneier
WeBill
Published in
5 min readAug 27, 2019

How to leverage the high availability of AWS and the ease of automation to allow poor devops engineers to sleep soundly at night.

Single points of failure or the redheaded stepchildren of architecture

Note: I had asked our graphics team for a cute duck mascot to accompany this article. They said no, so please just imagine a cute duck.

Let’s look at a scenario. Mr Duck is successful businessman who wishes to take his booming egg business to the cloud. After some research he opts to go with AWS as his cloud computing platform. He does some more research and identifies the perfect EC2 instance types to host his website, his web servers and database respectively. He even went so far as to separate his servers into public and private subnets. (All good so far, you may think to yourself as you enjoy your morning coffee and perhaps a french pastry).

But than Mr Duck shows you his architecture diagram and that fluffy, buttery croissant turns to ashes in your mouth. The coffee once so delicious is now soured with the taste of disappointment and lies.

What have you done Mr Duck???

Never speak to me or my architect ever again

It would seem that Mr Duck has put all his eggs in one basket as it were. The problem with the above design is simple: with no redundancy each instance has become a single point of failure.

Single points of failure (or SPOF’s as all the cool kids call them) are just what they sound like. A part of the system that if it fails, the entire system fails.

With each service only running an single instance, the loss of any instance will bring Mr Ducks entire business empire to the ground. And this unfortunately is not an entirely unlikey event. Instances go down all the time for all sorts of reasons and so if you are betting everything on a single stable server, sooner or later you going to find yourself a duck out of luck.

Check yourself before you wreck your bank account

A week later you run into Mr Duck at the local vegan sombrero market. Mr Duck has taken your advice to heart and pulled out all the stops in making his system highly available and fault tolerant. I’m talking autoscaling groups, Elastic Load Balancers, heath checks, everything a healthy young business needs to grow up big and strong. You take a look at his new archeticture diagram and feel a tear of joy in your eye. You couldn’t be more prou-

But wait, what’s this?

So close, yet so far

Oh no Mr Duck has deployed his system into a single availability zone. And even though the AZ is big and well managed and feels like a safe place, in truth it’s just another SPOF for the business. What we need to realise is that entire AZ’s can (and do) go down. If you are deployed in a single zone and that zone happens to suffer a major interruption or worse a catastrophic disaster, then you going to be just as dead in the water as you would have been if your single instance architecture failed.

Build for failure, Scale for success

Now AWS is very good and very quick at fixing AZ failures, these interruptions are not that common and your servers (and your data) will be there when it inevitably comes back up, but to paraphrase the famous saying: “Those who do not plan for failure, are doomed to fail…and then get called in on a saturday night to fix everything”.

What Mr Duck needs to understand is that there are two reasons we scale out. Firstly, we scale out for performance. In the above diagram, the ELB and auto scaling group is a performance (and probably a cost) improvement.

Secondly, we scale out to create highly available and highly reliable systems. The key to which is the magic word we all know and love: redundancy.

To have system free of single points of failures you need redundancy. There is no escaping that. And yes there will be added costs, but really if you smart about it you you can lump many of those costs in with performance improvements.

A cluster of load balanced auto scaling instances accross multiple AZ’s . Creating Multi-AZ read replicas of databases with automatic failover. Routing based on geo proximity, but ignoring unhealthy nodes. These are all relatively easy to implement design changes that will bless you with both the high performance and high availability you and your client and your boss so desperately crave.

And most importantly your precious Saturday nights will be safe

That’ll do duck, that’ll do

End of Part 1

So in summary, your first line of defence in disaster recovery should always be ….not to have a disaster in the first place.

Now let’s not kid ourselves: that’s not going to happen, but we can still be ready when everything does go up in flames. Expect things to go wrong, expect everything to go wrong, and prepare accordingly.

Remember at every step that servers and clusters and AZ’s and whole regions might fail. Realise that this is a risk that everyone has to face, from small startups to the Goliaths of cloud computing such as Netflix. And on that note here is what we in the business call a segway.

Netflix knows better then anyone that things at scale fail at scale. But they don’t just prepare for it to happen: they make it happen. Though no longer actively maintained the simian army is a set of tools for injecting chaos into your infrastructure at any scale.

From randomly stopping running instances, to simulate the taking down of whole AZ’s and even entire regions, this allows netflix to force failure and test if services still work regardless. You can do this too and I highly recommend you try it out some time to ensure that your HA design is as HA as you think.

But even if you have built a reliable, resilient system problems can still arise.

In part two of this three part series I’ll be looking at automatic disaster recovery for when things do go wrong. This will be a more technical in depth discussion and I do hope you join me again for it.

Until then: Don’t be a duck. Be lekkar.

--

--