Crashing our Regional Deployments (and surviving!)

Published in

adidoescode

9 min readMar 15, 2021

Resiliency is one of those “fancy words” all engineers like to mention from time to time. But do we really know what “resiliency” stands for? Do we know how to “achieve resiliency”?

We say that our systems (and we, ourselves) are resilient to certain events when they are able to adapt and overcome them (like us).

Of course, resiliency is not absolute. There’s a wide spectrum in which our systems move and it’s highly dependent on the infrastructure running our application. New deployment models, such as Kubernetes, have proven that they provide a noticeable improvement on the stability over Virtual Machine based deployments. Of course, adopting technologies of certain Cloud Providers is a game changer as we have nearly unlimited computing power.

But how can we increase the resiliency of our systems? Netflix knows the answer pretty well:

But just designing a fault tolerant architecture is not enough. We have to constantly test our ability to actually survive these “once in a blue moon” failures.
— The Netflix Simian Army

Chaos engineering to the rescue! Resiliency is one of those properties that you only confirm when something goes wrong. So the best way to improve your system’s resiliency is to embrace chaos and force that “something” going wrong. What something? Anything that may happen!

Chaos variables reflect real-world events. Prioritize events either by potential impact or estimated frequency. Consider events that correspond to hardware failures like servers dying, software failures like malformed responses, and non-failure events like a spike in traffic or a scaling event. Any event capable of disrupting steady state is a potential variable in a Chaos experiment.
— Principles of Chaos Engineering

Now that we are on the same page, we can start thinking about those events. What could be the worst possible scenario? Well, Internet outage. But that’s something you can barely do something about. Next one: a data center failure. That’s something more interesting, we can start thinking about redundancy, active-active configurations and some more fancy words. As mentioned before, Cloud Providers are a game changer in this sense. So we’ll assume that we’re working with AWS.

Within AWS, you can deploy your services in several regions AKA geographical grouping of AWS data centers. Each one consists of at least 3 Availability Zones AKA data center (disclaimer: an Availability Zone may consist in more than one data center, but for the current article let’s assume that one Availability Zone correspond to one data center). Best practices dictates that you should deploy your services across multiple Availability Zones but what would happen if an entire region crashes? Is that a realistic scenario?

It is. Uncommon, yes, but feasible. In November 2020, AWS suffered an outage in North Virginia region affecting several services. In September 2015, another issue in DynamoDB caused a regional outage. So yes, those kind of events are unlikely to happen, but can definitely happen. In the next lines, we will analyze a use case and how we can design a system in a way that a regional outage doesn’t translate into loss of business revenue.

Our requisites

For the sake of simplicity, we will build a link shortener application. It should provide two features:

Given a long URL, it should generate a short one
Given a shortened URL, it should redirect to the long one

A real example of this behavior would be:

A user wants to shorten the URL https://some-domain.com/some-path/another-path?queryParam1=demo&queryParam2=demo. The generated URL is https://my-shortener-service.com/demo
A user access https://my-shortener-service.com/demo through a browser. The service returns a redirect to https://some-domain.com/some-path/another-path?queryParam1=demo&queryParam2=demo

Last, but not least, the core feature of this article: an outage in an AWS region should have minimum impact on our users.

Designing our architecture

Let’s start designing our system! We will go for quite standard tech stack here:

It’s a pretty simple architecture. With Amazon Route 53, we will provide our users a simple name for them to use our service. If you, like us, have a wonderful platform team providing a shared Kubernetes cluster, you will deploy your service in your namespace. We could go for a Java based backend or another technology, it’s just a regular deployment. For our DB, we will rely on the good ol’ MySQL. As we make an intensive usage of AWS, we will ramp up an Aurora MySQL compatible cluster. Everything is running in eu-west-1 (Ireland) AWS region.

Cool! Our service is up & running. We’re finished with the project. But… Are we?

Evolving our architecture

Yes, of course our system is working. But let’s check some improvement points:

Do we need all MySQL power for running our service? It seems a bit too much…
Do we need a pod running 24x7, probably staying idle at night? It seems a waste of resources…
SpringBoot is a quite standard stack but its memory & CPU requirements are a bit high. Can we do something to improve it?
This article is about regional resiliency. Will this setup survive a regional outage?

The first point is easily addressable. Our use case is perfect for a key-value based DB, so let’s put one in there!

Amazon DynamoDB is a fully managed, key-value and document database. It supports perfectly our use case and has a particular feature that we will seize for our final resilient setup. Also, we can use its serverless capabilities to make sure that we only pay for our real usage! But we’re still paying for the Kubernetes resources. Well, we introduced serverless technologies at our DB layer, so let’s do it as well for our application!

AWS Lambda is the serverless service of AWS. It allows us to run our functions on demand. We can integrate them with the Amazon API Gateway so we expose a REST interface for our users. Both services are billed in a per-request & execution time model, so we’re saving money! Also, being able to separate different endpoints in different lambdas allows us to code them in different languages in a framework agnostic way. You only need to specify the lambda entrypoint. While you could maintain the Spring annotations, I highly recommend you not doing it. The serverless world rules are a bit different than the microservices one.

We have changed quite a bit our architecture. But we still have improvement points. Yes, of course, using serverless technologies is an improvement. But do we really need Lambda functions? If we could remove them, we could decrease our response times (and our bill!). Luckily for us, Amazon API Gateway can integrate with other AWS services. It provides as well input validating features so we have all we need.

Finally, we have super simple setup fulfilling our requisites.

Resiliency through replication

Our new setup could not be simpler. Adopting a cell-based architecture, our cell will be consisting on the API Gateway and the DynamoDB table.

A cell is a collection of components, grouped from design and implementation into deployment. A cell is independently deployable, manageable, and observable.
— Cell-Based Architectures reference

Each cell is capable of providing the required features on its own. Plus, by using DynamoDB Global Tables we make sure that the data is synchronized between up to five replicas with less than one second latency while providing an active-active configuration. So we can use the table of any region as if it was our main database, all the changes will be propagated to the rest of regions. Now we have everything we need for setting up our global resilient setup!

Replication! That’s key for a resilient system. Beware that at any point in time, a region may fail so we need at least two available regions. Since our use case is simple enough, we can segment our users based on their location so they’ll use the cell that is closer to them (latency based routing). Our periodic health checks will make sure that all cells are alive at any point in time and, if one of them dies, it will be discarded and all traffic will be sent to the living one.

Verifying our assumptions

In theory, we have a system resilient to regional failures. I say “in theory” because unless I can prove it, it’s not. We should not claim to have a resilient system in production unless we really confirm that it is. Don’t forget that resiliency is not black or white, there’re a lot of questions to answer. Does it really work? How long does it take to failover to other region? Is one of our cells capable of assuming all the load of both cells? Do we have automatic self-healing procedures?

Without data, you’re just another person with an opinion
— W. Edwards Deming
Without data, you’re just another person with an opinion. And I’m your boss. So my opinion is more valid than yours
— Boss I’ve worked with

Everything is about data. Data is objective, data can’t be discussed. Data provides meaningful information that should drive decisions. We need data. So let’s start breaking things 😈

You know. For data. Not for fun. Serious purposes here 😬

For the sake of the article, I’ll just focus on the first two questions:

Is the system really resilient to regional outages?
How long does it take for the system to failover?

Those two questions will give us some hints to determine the blast radius of a regional outage. In order to answer them, we will simulate traffic to our Ireland cell and then we will shut it down. The easiest way will be directly deleting the deployed API Gateway. That will show us how long our european users will suffer an outage while the failover occurs. Afterwards, we will redeploy everything in Ireland to see the effect on our users. Remember that we have two cells running: one in Ireland and another in Oregon. The following load was run from a local laptop located in Spain.

Let’s analyze what happened there:

At first, our API works fine. 230ms latency, not bad for an Spain-Ireland connection. No user faces any error
We delete the API Gateway. Users start facing errors at ~1:10
Route 53 health check detects that Ireland cell is dead. It removes it from the list of available cells and redirects all traffic to Oregon at 1:30
Our traffic now is Spain-Oregon. It explains the increase in response times to 820ms and the decrease in the error rates.
Manually, we redeploy our Ireland cell. Route 53 detects it as a valid cell and includes it in the list of available cells
Our traffic is Spain-Ireland again. Back to usual 220ms latency

With this information, we can confirm it. Yes, our system is resilient to regional outages. In case one of them happen, users closer to that region will suffer an outage during ~20 seconds. Of course, we can refine our solution to reduce those numbers even more. And by having more cells across the globe the latency will decrease.

Don’t forget that we just run one test. We should keep an eye on the topic and run these tests in a regularly basis to evaluate the evolution of the numbers.

Conclusion

Resiliency is not something we can simply address at the end of our development. Architectures should be designed with this concept in mind, as not doing it might incur in rewriting our app several times. Iterate over your architectures design and keep improving them until most questions have an answer.