Architecting for Reliability Part 3— High Availability Architectures

Sathiya Shunmugasundaram
becloudy
Published in
6 min readMar 23, 2018

This is part 3of the Architecting for Reliability Series

Availability Goal Scenarios

In this section, we will review a sample application and lay out how the deployment architecture varies for different availability goals. The sample application is a typical web application which has a reverse proxy, static content in S3, application server and SQL database. The availability design remains same whether we deploy them in containers or VMs.

Choice of Services

We will use EC2 for compute, Amazon RDS for Relational DB and take advantage of Multi-AZ deployments. Will use Route 53 for DNS, ELB for distributing load and use S3 for backup and static content.

99% (2 9s) Scenario

Application Characteristics

  • These applications according to availability chart, can have downtime of about 3 days and 15 hours/year.
  • These applications are usually helpful to business and can be inconvenient (not mission critical) if they are unavailable.
  • Most internal systems come into this category along with experimental customer features.

Deployment design

  • Single Region
  • One availability zone
  • Single Instance
  • Backup data sent to S3 for recovery, with versioning enabled for objects and deletion disabled for backups, Lifecycle policies to archive/delete old data
  • Cloudformation to define infrastructure as a code and will be used to speed up reconstruction of entire infrastructure in case if failure.
  • During failures, using DNS change, route traffic to static website
  • Deployment pipeline is scheduled with basic unit/black box/white box testing
  • Software updates are manual and need downtime
  • Monitoring looks for 200 OK status for home page

Availability Calculation

In this design, each failure will take about 70 mins for recovery. Each deployment / software update will take 4 hours. Estimating about 4 failures and 6 other changes, the availability comes 99%.

99.9% (3 9s) Scenario

Application Characteristics

  • These applications according to availability chart, can have downtime of about 8 hours and 45 mins/year.
  • These applications are important to be highly available but can tolerate brief periods of unavailability.
  • Examples are critical internal applications and low revenue customer facing applications.

Deployment Design

  • We will leverage AWS services that take advantage of multiple Availability Zones. (ELB/ASG/RDS MultiAZ)
  • Load balancer will be configured with application health check which actually depicts health of application in each instance
  • ASG will replace health check failed instances, RDS will fail over to a second AZ for primary AZ failures
  • Application will be split into separate tiers (Reverse Proxy/Application Server) to improve availability. Application resiliency patterns will ensure that brief DB unavailability during AZ failover doesn’t impact the application availability
  • Automated software updates using in-place method, with rollback procedures documented in case of faillures
  • Software delivery on a fixed schedule every 2–4 weeks
  • Monitoring will check for 200 OK status on Home page, very replacement of web server, DB fail overs and static content availability in S3
  • Logging will be aggregated for Root Cause Analysis
  • Runbooks exist for recovery and reporting
  • Playbooks exist for common db related issues, security related incidents, failed deployments and for root cause analysis.

Availability Calculation

Assuming 2 failures that need manual intervention and 60 min per incident, impact will be 2 hours. Assuming automated software updates that require downtime of 15 min per occurrence and 10 such instances we will need 150 min downtime. This gives us 99.9% availability

99.99% (4 9s) Scenario

Application Characteristics

  • These applications according to availability chart, can have downtime of about 52 mins/year.
  • These applications are must be highly available and be tolerant to component failures and be able to absorb failures without needing ti procure component failures.
  • Examples are e-commerce applications and b2b web services.

We should design this by being able to be statically stable within a region. That means we need to be able to tolerate loss of one AZ without needing to provision new capacity or changing DNS etc..

Deployment Design

  • Deploy the application in 3 AZs with 50% capacity in each AZ
  • For content that can be cached, add CloudFront to reduce load on the system
  • Implement software/application resiliency patterns in all layers
  • Engineer read availability over write availability of primary content
  • Leverage fault isolation zones deployment strategy
  • Deployment pipeline must also include performance, load and failure injection testing
  • Deployment should be automated fully with automatic rollback in case KPIs are not met
  • Monitoring should report success as well as alert when problems occur
  • Playbooks must exist for undiscovered issues and security incidents
  • Test failure procedures using game days

Availability Calculation

Assuming 2 failures that need manual intervention and 15 min per incident, impact will be 30 mins. Automated software updates should not require downtime. This gives us 99.99% availability

Multi-Region Deployments

Using multiple geographical regions will provide greater control over recovery time at the cost of increased expenditure. Regions provide very strong isolation boundary.

Multi-Region Deployment Courtesy of http://harish11g.blogspot.com

99.95% (3.5 9s) Scenario using Multi-Region Deployment

Application Characteristics

  • These applications according to availability chart, can have downtime of about 4 hours /year.
  • These applications are must be highly available and require very short downtimes and little loss of data
  • Examples are banking, investing and emergency services

Deployment Design

99.95 % SLA
  • Use Hot standby across two regions
  • Passive site scaled and kept eventually consistent to receive same traffic as active site
  • Both regions should be statically stable to handle all capacity requirements even during 1 AZ failure
  • Implement software/application resiliency patterns in all layers
  • Will need a light weight routing component for monitoring application health and regional dependencies. Routing component will automate failures, stop replications
  • Requests will be routed to static website during failover
  • Software updates will use Blue Green/Canary deployment methodologies
  • Deployment pipeline must also include performance, load and failure injection testing
  • Monitor server/db/static content and region failures and alert
  • Validate architecture through game days using runbooks

Availability Calculation

Assuming 2 failures that need manual intervention and 30 min per incident, impact will be 60 mins. Automated software updates should not require downtime. This gives us 99.95% availability maximum

99.999% (5 9s) or higher Scenario

Application Characteristics

  • These applications according to availability chart, can have downtime of about 5 mins/year
  • These applications are must be highly available and allow no downtimes and loss of data
  • Examples are high revenue banking, investing and critical government functions

Deployment Design

  • Strongly consistent data stores
  • Complete redundancy in all layers
  • Use NoSQL databases where possible to improve partitioning strategy
  • Leverage Active/Active Muti-Region approach. Each region must be statically stable
  • Routing layer will send traffic to healthy sites and stop replication during failures
  • Implement software/application resiliency patterns in all layers
  • Deployment pipeline must also include performance, load and failure injection testing
  • Software updates will use Blue Green/Canary deployment methodologies
  • Deployment should be automated fully with automatic rollback in case KPIs are not met
  • Datastore replication techniques should resolve conflicts automatically

Availability Calculation

Assuming all recovery procedures are automated and with redundancy, impact will be less than a min and expect 4 such occurrences. Automated software updates should not require any downtime. This gives us 99.999% availability

--

--

Sathiya Shunmugasundaram
becloudy

Freelance writer in DevOps, Cloud, Resiliency, MicroServices and Containers