The Cloud helps us move fast, reduces time to develop & ship, and eventually saves us time and money. But we often forget about how the cloud helps us build reliable apps. Modern applications need to be always-on and the cloud plays an important role in achieving this critical requirement.
Adrian Hornsby covers this topic pretty well in his recent blog posts on operational excellence at Amazon.
In this blog post, as an avid AWS user and advocate, I want to take a closer look at a specific, often overlooked, aspect of the Cloud’s reliability promise: Global Infrastructure.
We, at Opsgenie, started our life on AWS in 2012. We used a lot of different AWS services and often wrote about them. SQS, EC2, VPC, Lambda, SNS, RDS, DynamoDB and more AWS services help us offer a service with five nine’s of availability. Having access to global infrastructure plays an important role in this. Let’s see how.
All highly available apps operate on multiple availability zones (AZs)
In AWS, a region represents a physical location around the world with a cluster of data centers. These data centers are called availability zones (AZs). Each region has at least two, often three availability zones. AZs within a region are interconnected through high-bandwidth and low-latency networking.
Multiple data centers within a region enable users to easily architect applications that automatically fail-over without interruption. This mechanism is the foundation behind fault-tolerant applications on the cloud. The Cloud doesn’t promise running a machine 24/7 without any problems. It gives you the infrastructure, low-latency networking, and failover capabilities so you can architect your apps to realize your needs.
For example, almost all critical applications in Opsgenie ran on EC2 machines on production for so long. Even if an app didn’t get a lot of requests, we deployed it to three availability zones. This is important to ensure improved continuity. The difference between using three machines in one availability zone versus using three machines in three different availability zones is practically life and death for your apps. If there is a hardware or software problem, chances are high that your apps will be affected if they all run in the same AZ.
What if you have higher availability requirements? That is when you need to go beyond operating on multiple zones and start considering region-wide failover.
Go beyond AZs and do region-wide failover
Having global infrastructure matters so much for building better failover mechanisms. When you need more nines, you need to think about region-wide failures, like Netflix.
Building a multi-region architecture is hard in many ways. One can even argue that if you are doing “active-active”, running applications in multi-region instead of just keeping them ready for failover, your reliability can be even worse than your original zone-based structure. This kind of architecture requires a lot of operational and cultural maturity within your organization.
One of the big deal breakers in these types of architectures is latency. Most applications won’t be able to support the required quality of service if they access their data across multiple regions. That is why most companies start with having a backup/failover region instead of an active-active approach.
Multi-region as a backup approach still means thinking about a lot of requirements and edge cases. You want your backup (failover) region to be close to your master region so you can reduce replication time. You also do not want your regions to be close to each other to reduce the impact of a natural disaster. And these are just _some_ of the considerations you have to account for when architecting this kind of system. Let’s see how this works in the real world.
When we decided to deploy Opsgenie to another region as a backup, we needed to replicate our data in DynamoDB tables in near real-time. We spent a lot of time architecting a solution that replicates data, resolves conflicts, and migrates and updates data in multi-region. You can find our blog post on this here.
I remember the time when AWS announced DynamoDB Global Tables. AWS started offering the exact same thing we developed internally as a solution. We were furious with AWS and ourselves — a lesson learned: talk with your account managers more :) But we were also happy because managing something like this is a big commitment and we always prefer to outsource these to our cloud provider.
Our work for building a failover region consisted of replication S3 buckets, DynamoDB tables, Lambda functions, SQS queues, SNS topics, and more. For some managed tools like DynamoDB, these things are easier. Some services like S3 and EC2 are also relatively easy and cheap to replicate using Direct Connect.
When you need region-wide failover, you realize the importance of having access to great global infrastructure with high-speed networking and bandwidth. You also realize the importance of your cloud provider’s capabilities. AWS can get complicated but it has great infrastructure with a lot of great tools behind it.
Follow regulations, deploy wherever you need
When you are cloud-native, you can deploy your whole app to another region with minimal installation and operational costs. A lot of our customers in the EU, especially in Germany, started asking a way to keep their data within the EU three years ago. Based on the demand, we decided to deploy Opsgenie in an EU region.
It didn't take much for the Opsgenie SRE team to deploy our apps to another region. The key was to find a region that offered the same services as our original region. As an old and mature one, the Frankfurt region had everything we needed. After a few weeks, we were operating in Europe and were able to offer the same level of quality, scalability, and reliability. We didn’t have to make any upfront commitment when we needed to deploy on the other side of the ocean.
This was another proof of the power of having access to great global infrastructure.
I have spent the last four years at Opsgenie, the first half as an engineer working on challenging problems on the cloud.
In this blog post, I shared my view on the importance of global infrastructure and how AWS's global infrastructure enabled us through our journey.
To summarize, global infrastructure matters because:
- Operating always-on services requires deploying apps to multiple data centers with high-speed networking and bandwidth.
- Critical operations need a region-wide failover mechanism. Services with built-in multi-region features and supporting tools are critical for architecting such resilient systems.
- Regulations mean dangers and opportunities. Having access to a global infrastructure creates business opportunities for cloud-native companies.