Disaster Recovery for Cloud Solutions is Obsolete

First in a series on designing for “Disaster Tolerance” in the cloud

Published in

CloudPegboard

9 min readApr 24, 2019

Thirty years ago today, I was deep into procurement for my new content distribution startup, Weather Fax, Inc. (an aviation weather service). I’m now at nearly the same phase of my new startup Cloud Pegboard (a tool for AWS practitioners). However, this time because we now have the cloud, I was done with compute procurement in about 30 seconds, and try as I might, I can’t seem to spend more than a couple of dollars a month past the AWS free tier.

The present economics of compute infrastructure for startups and enterprises alike are dramatically different than my 1989 venture. Take a look at the fragment from my inventory for my first “computer room.” One server (an amusingly underpowered 20 MHz ‘286 with 2 MB of RAM and 80 MB of hard disk) cost $3100 (equivalent to $6500 in today’s dollars). Our ten such servers therefore represented a $65k equivalent investment (that would buy over 16M Amazon EC2 t3.micro hours today!). At least my dumpster diving scored me a free 19” rack!

*Inventory list from my 1989 startup to provide aviation weather to pilots via phone/voice and fax*

I bootstrapped my small company, but it was common for tech startups at the time to require significant investment to build or lease data center space and infrastructure. Of course, if scarce funding resources are being spent on computers, then less is available for engineering, marketing, and sales. Given this backdrop, there was a very high barrier to the idea of creating a disaster recovery (DR) site. The disaster risk was quite real since all the eggs of your tech business were in one location and disasters really do happen, but the cost to mitigate that risk was also quite high.

Weather Fax had failures. Ironically, weather induced (hurricane “Bob” in one case, microburst in another). We could not afford to deploy a DR capability. We had 4 hours of UPS battery, but we did exceed that more than once. I still have a cold sweat Pavlovian response if I ever hear the iconic musical scale sound of my old Skytel pager.

If you could get away with a risk calculation and not build a DR site, you would just close your eyes and hope for the best (interestingly, this is how my mom used to merge into the Boston airport tunnel where it merged from 8 to 2 lanes). If not and you were sufficiently funded, then you built a DR site, but tried to minimize the cost as much as possible (still today’s practice). Many DR plans include a DR site that has just enough capacity to get by, others hedge their bets and even plan on emergency procurement for some elements, others cycle their old hardware to their DR site. In all cases, DR sites are deemed as necessary insurance, but no one really wants to pay for insurance, do they.

Shifting how we think about disaster mitigation

DR concepts have been around in tech for a long time. It’s part of our IT algebra and embedded in our common vernacular when we talk about redundancy using terms such as N+1 or 2N redundancy, DR sites, DR plans, and DR testing. All of these concepts are important and apply well to traditional IT infrastructure. In fact, arguably disaster risks are far greater for traditional (non-cloud) infrastructure since a single backhoe, fire, earthquake, or flood are all common threats that can easily knock out any given site.

Modern cloud-based solutions are different though. In the past the quality and resiliency of your data center was directly proportional to your investment. You could build/lease a Tier 4 (99.995% uptime) data center, but a significant number of businesses were/are in Tier 1–3 data centers and I suspect a great uncounted number of businesses were running out of server rooms or servers under desks (I’ve seen quite a few businesses running substantial operations with servers slid into those chrome wire shelves — shall we call these Tier minus 1?).

In contrast, any modern cloud data center is equal or better than any traditional Tier-4 data center. Therefore, the risk of failures is much lower and the types of threats that can cause a major outage are increasingly esoteric (it’s going to take a lot more than a contractor forgetting to call Dig Safe to knock out an AWS datacenter or availability zone or region offline). Add to that, capacity in a region is flexibly distributed among multiple (sometimes dozens) data center buildings in a given geography.

Perhaps the most significant difference in how to think about DR in a cloud context is that cloud resources are elastic.

Perhaps the most significant difference in how to think about DR in a cloud context is that cloud resources are elastic. If I want to build a DR site, I don’t need to dedicate a fixed reserve capacity. Terms like 2N redundancy don’t quite apply anymore since if I’m using some form of autoscaling, my DR site does not equal N. It’s more like a potential N while the primary is a kinetic N. In the same way that we’ve had to shift our mindset, behaviors, and language to accommodate the benefits of Agile methods compared to decades of Waterfall approaches, and to shift to DevOps from siloed approaches, we need to start to change how we think about DR concepts.

In fact, I’ll start by saying that the term “Disaster Recovery” itself is an outdated concept for many cloud-based solutions. It presupposes that we are working with expensive, static, physical gear that needs to be brought online as a discrete “recovery” activity in the event of a disaster incident. This is antithetical to the realities of an elastic cloud where resources can be scaled up dynamically in diverse regions, and this can all be done with software by treating infrastructure as code. Instead of disaster recovery, I suggest that we start to talk about Disaster Tolerance (DT). The distinction may seem subtle, but the difference is similar to treating your servers as cattle instead of pets. We should handle disaster events as automatic failovers or simply just load shifting as part of the intrinsic design instead of requiring manual intervention.

We should handle disaster events as automatic failovers or simply just load shifting as part of the intrinsic design instead of requiring manual intervention.

By thinking about Disaster Tolerance at the outset, we can make early design decisions that cost very little, but yield significant benefits for a reliable, resilient, and cost-effective technical operation. It’s always hard to retrofit non-functional requirements after the fact. It’s a shame to have to since with just a little extra patience at the start, the incremental effort and cost can be minimal.

As a quick example of how Cloud Pegboard handles DT of its underlying API and microservices, we use AWS CloudFormation to create a stack that includes an Amazon API Gateway and a set of AWS Lambda functions (I’ll defer discussion of the rest of the stack until my next post). It takes under a minute and a single command to deploy this to a given region (a parameter). Therefore, creating a secondary region takes another minute by just changing the region parameter. In front of this we use Amazon Route 53 to route traffic to the different API Gateways based on health and latency. Since API Gateway and Lambda, like all AWS services only charge for what is used, we have a 2N potential, but still only pay for N (which is itself a dynamic value based on demand). As you can see for a properly parameterized CloudFormation template, there is no extra work to get the full benefit (and continuous deployment updates are just as easy). More importantly, there is no “DR Plan” to implement, if a region fails or even if I break one region with some bad code, traffic will seamlessly automatically failover to the alternate region. This same architecture can also be used for blue/green deployments giving even more utility from some pretty simple architectural choices.

*Simplified view of Cloud Pegboard’s disaster tolerant design for its APIs (persistence not shown)*

Why design for Disaster Tolerance up front

Disaster Tolerance confers the same benefits of any fault tolerant or self-healing design. Your customers are not impacted in the face of inevitable failure and your business does better as a result. If you are on the tech front-lines, then you are not having anxiety attacks when things might go south or when they actually do fail, and while being a hero once in your life to bring a dead system back online may be an experience worth collecting, I bet that after one you’ll be glad to not get the jumps whenever you hear that alert tone your phone uses for an operational disaster. It’s nice to be needed, but it’s also nice not to hear “enjoy your vacation; please remember that you must be able to get to your laptop and internet access within 15 minutes of a critical incident alert.”

I believe in Lean approaches as well as the architecture principle of YAGNI (You Aren’t Gonna Need It). And if you are building a rapid throw-away prototype (that you absolutely positively promise will not be v1 or your production code), then you can of course skip the DT. However, once you start building the first production iteration, the core design should have DT as a first-class citizen. This will allow you to make those easy decisions now, that can be very hard to retrofit later. Even if you reason that a disaster is unlikely to occur for a while, if you are going to sell to an enterprise or government customer, well before you are ready, they are sure to say that they’d love to sign up just as soon as they review your security and operations. You want to be able to say, “sure thing, how soon can we meet?” and not, “well…, DR is in our backlog.” If DR required $3 million and months to build out, then that’s a different story. But with the cloud, building DT can cost just a few person-days or weeks. In return, you’ll be able to say to your CEO and your customers that you’ve got it covered, you’ll be able to take risks and vacations knowing that it won’t take heroics when a disaster occurs, and your disaster testing will be minimal with little risk of customer impact even with a full game-day exercise in the middle of a Wednesday.

And here’s how. It’s easy… or is it?

In my next post, I’m going to dive way down from this abstract level to specific and practical advice for designing for Disaster Tolerance using serverless approaches on the AWS cloud. Hold your hat for a second, here’s my big hypocrisy (sorry)… in concept, adding DT at the outset is minimal extra effort. However, based on my recent clean slate experience of doing this for Cloud Pegboard, I discovered that there are enough gotchas and subtle limitations with AWS features and capabilities, that converging on a good solution after going down several dead ends can cost a lot of time that tarnishes some of the beauty of what should be possible. My goal is to share what I’ve learned so that you can learn from my experiences and avoid those costly tangents and ultimately be able to say that DT was easy to add and I’m not really a hypocrite after all.

Conclusion

If you are building a cloud solution that is generating value, you will need to mitigate against disasters and other major failures. They will happen at some point, and even if not, someone is going to require you to demonstrate that you can effectively survive a disaster (even if they don’t tell you up front). As with many non-functional requirements, by making some smart and deliberate choices up front, you can gracefully meet your future requirements and gain correlated benefits for very little additional upfront time and cost.

About Cloud Pegboard

Cloud Pegboard helps AWS practitioners be highly effective and efficient even in the face of the tremendous complexity and rate of change of AWS services. We want to keep AWS fun and feel more like dancing in a sprinkler on a hot summer day and less like being blasted with a firehose. We make the AWS information that you need amazingly easy to access, and with personalization features, we help you stay up to date on the AWS services and capabilities that are important to you and your specific projects.