Architecting for Reliability Part 3— High Availability Architectures

Published in

becloudy

6 min readMar 23, 2018

This is part 3of the Architecting for Reliability Series

Availability Goal Scenarios

In this section, we will review a sample application and lay out how the deployment architecture varies for different availability goals. The sample application is a typical web application which has a reverse proxy, static content in S3, application server and SQL database. The availability design remains same whether we deploy them in containers or VMs.

Choice of Services

We will use EC2 for compute, Amazon RDS for Relational DB and take advantage of Multi-AZ deployments. Will use Route 53 for DNS, ELB for distributing load and use S3 for backup and static content.

99% (2 9s) Scenario

Application Characteristics

These applications according to availability chart, can have downtime of about 3 days and 15 hours/year.
These applications are usually helpful to business and can be inconvenient (not mission critical) if they are unavailable.
Most internal systems come into this category along with experimental customer features.

Deployment design

Single Region
One availability zone
Single Instance
Backup data sent to S3 for recovery, with versioning enabled for objects and deletion disabled for backups, Lifecycle policies to archive/delete old data
Cloudformation to define infrastructure as a code and will be used to speed up reconstruction of entire infrastructure in case if failure.
During failures, using DNS change, route traffic to static website
Deployment pipeline is scheduled with basic unit/black box/white box testing
Software updates are manual and need downtime
Monitoring looks for 200 OK status for home page

Availability Calculation

In this design, each failure will take about 70 mins for recovery. Each deployment / software update will take 4 hours. Estimating about 4 failures and 6 other changes, the availability comes 99%.

99.9% (3 9s) Scenario

Application Characteristics

These applications according to availability chart, can have downtime of about 8 hours and 45 mins/year.
These applications are important to be highly available but can tolerate brief periods of unavailability.
Examples are critical internal applications and low revenue customer facing applications.

Deployment Design

We will leverage AWS services that take advantage of multiple Availability Zones. (ELB/ASG/RDS MultiAZ)
Load balancer will be configured with application health check which actually depicts health of application in each instance
ASG will replace health check failed instances, RDS will fail over to a second AZ for primary AZ failures
Application will be split into separate tiers (Reverse Proxy/Application Server) to improve availability. Application resiliency patterns will ensure that brief DB unavailability during AZ failover doesn’t impact the application availability
Automated software updates using in-place method, with rollback procedures documented in case of faillures
Software delivery on a fixed schedule every 2–4 weeks
Monitoring will check for 200 OK status on Home page, very replacement of web server, DB fail overs and static content availability in S3
Logging will be aggregated for Root Cause Analysis
Runbooks exist for recovery and reporting
Playbooks exist for common db related issues, security related incidents, failed deployments and for root cause analysis.

Availability Calculation

Assuming 2 failures that need manual intervention and 60 min per incident, impact will be 2 hours. Assuming automated software updates that require downtime of 15 min per occurrence and 10 such instances we will need 150 min downtime. This gives us 99.9% availability

99.99% (4 9s) Scenario

Application Characteristics

These applications according to availability chart, can have downtime of about 52 mins/year.
These applications are must be highly available and be tolerant to component failures and be able to absorb failures without needing ti procure component failures.
Examples are e-commerce applications and b2b web services.

We should design this by being able to be statically stable within a region. That means we need to be able to tolerate loss of one AZ without needing to provision new capacity or changing DNS etc..

Deployment Design

Deploy the application in 3 AZs with 50% capacity in each AZ
For content that can be cached, add CloudFront to reduce load on the system
Implement software/application resiliency patterns in all layers
Engineer read availability over write availability of primary content
Leverage fault isolation zones deployment strategy
Deployment pipeline must also include performance, load and failure injection testing
Deployment should be automated fully with automatic rollback in case KPIs are not met
Monitoring should report success as well as alert when problems occur
Playbooks must exist for undiscovered issues and security incidents
Test failure procedures using game days

Availability Calculation

Assuming 2 failures that need manual intervention and 15 min per incident, impact will be 30 mins. Automated software updates should not require downtime. This gives us 99.99% availability

Multi-Region Deployments

Using multiple geographical regions will provide greater control over recovery time at the cost of increased expenditure. Regions provide very strong isolation boundary.

Multi-Region Deployment Courtesy of http://harish11g.blogspot.com

99.95% (3.5 9s) Scenario using Multi-Region Deployment

Application Characteristics

These applications according to availability chart, can have downtime of about 4 hours /year.
These applications are must be highly available and require very short downtimes and little loss of data
Examples are banking, investing and emergency services

Deployment Design

Use Hot standby across two regions
Passive site scaled and kept eventually consistent to receive same traffic as active site
Both regions should be statically stable to handle all capacity requirements even during 1 AZ failure
Implement software/application resiliency patterns in all layers
Will need a light weight routing component for monitoring application health and regional dependencies. Routing component will automate failures, stop replications
Requests will be routed to static website during failover
Software updates will use Blue Green/Canary deployment methodologies
Deployment pipeline must also include performance, load and failure injection testing
Monitor server/db/static content and region failures and alert
Validate architecture through game days using runbooks

Availability Calculation

Assuming 2 failures that need manual intervention and 30 min per incident, impact will be 60 mins. Automated software updates should not require downtime. This gives us 99.95% availability maximum

99.999% (5 9s) or higher Scenario

Application Characteristics

These applications according to availability chart, can have downtime of about 5 mins/year
These applications are must be highly available and allow no downtimes and loss of data
Examples are high revenue banking, investing and critical government functions

Deployment Design

Strongly consistent data stores
Complete redundancy in all layers
Use NoSQL databases where possible to improve partitioning strategy
Leverage Active/Active Muti-Region approach. Each region must be statically stable
Routing layer will send traffic to healthy sites and stop replication during failures
Implement software/application resiliency patterns in all layers
Deployment pipeline must also include performance, load and failure injection testing
Software updates will use Blue Green/Canary deployment methodologies
Deployment should be automated fully with automatic rollback in case KPIs are not met
Datastore replication techniques should resolve conflicts automatically

Availability Calculation

Assuming all recovery procedures are automated and with redundancy, impact will be less than a min and expect 4 such occurrences. Automated software updates should not require any downtime. This gives us 99.999% availability

Architecting for Reliability Part 3— High Availability Architectures

Availability Goal Scenarios

Choice of Services

99% (2 9s) Scenario

Application Characteristics

Deployment design

Availability Calculation

99.9% (3 9s) Scenario

Application Characteristics

Deployment Design

Availability Calculation

99.99% (4 9s) Scenario

Application Characteristics

Deployment Design

Availability Calculation

Multi-Region Deployments

99.95% (3.5 9s) Scenario using Multi-Region Deployment

Application Characteristics

Deployment Design

Availability Calculation

99.999% (5 9s) or higher Scenario

Application Characteristics

Deployment Design

Availability Calculation

Written by Sathiya Shunmugasundaram