Architecting for Reliability Part 1— Concepts

Published in

becloudy

6 min readJan 28, 2018

Reliability

The goal of this story is to provide the best practices of Reliability to manage your cloud environment. Though it has references to AWS and Cloud, the principles can be equally applied to other cloud providers and on-premise environments.

There will be 3 parts of this Story.

Part 1 — Concepts describing the various terminologies and best practices

Part 2— Resiliency and Availability Design Patterns for the Cloud

Part 3— High Availability Architectures to achieve specific SLAs

This story focuses on Part 1

AWS has published the Well Architected Framework in 2015 and described five pillars for Well Architected Principles namely

Reliability
Security
Performance Efficiency
Cost Optimization
Operational Excellence

This story is focused on the Reliability Pillar and provides salient points of Reliability for quick reference and is based on the AWS white papers and Microsoft Cloud Design Patterns.

Reliability is focused on the following aspects

Ability to recover from infra/service failures
Ability to acquire in-demand resources quickly to meet demand
Mitigate configuration and transient network issues

Availability

Availability = Normal Operation Time / Total Time, and is expressed as a % of uptime such as 99.99%. In short, its expressed as number of 9s for example 99.999 is expressed as Five9s

For quick reference, a 99% availability means 3 days and 15 hours of downtime and 99.999% means 5 mins of downtime per year. To calculate specific, use the link https://uptime.is/99.9 or check out https://en.wikipedia.org/wiki/High_availability

If there is a dependency, then availability is a product of the availability of the system and dependent system. For example, if System A has 99.9% and System B has 99.9% and A depends on B, the availability of aggregate system is 99.9 * 99.9 = 99.8%

If there’s redundancy of a system then the uptime is calculated as 100 minus product of component failure rate. For example, if the availability is 99.9, then failure rate is 0.1% and resulting availability is 100 minus (0.1 * 0.1) = 99.99% which shows an increased availability due to redundancy.

Sometimes the availability of dependencies may not be known. If we know the Mean Time Between Failures (MTBF) and Mean Time to Recover (MTTR), then Availability = MTBF/(MTBF + MTTR), for example, an MTBF of 150 days and MTTR of 1 hour means an availability of 99.97%. Use the link for more deep dive on this.

Achieving highest levels of availability means higher cost and complexity of systems in deployments/capacity addition/rollback etc. It’s often not unusual to pay for established systems for higher availability even if all features are not used. Care must be taken to come up with right availability goals with cost in mind.

Resilient Network Design — Key guidelines

Design VPC and Subnet CIDR blocks sizing with future in mind considering all peering possibilities, on-premise overlaps, amount of IPs needed for Lambdas/ELBs etc.
Use redundancy for VPN connectivity and Direct Connect
Leverage AWS services like AWS Shield, WAF, Route 53 to prevent DDoS attacks

Application Design for High Availability

Designing for HA of Five 9s of higher require extreme considerations. Many service providers and software libraries are not designed to offer such numbers which require adding multiple service providers, avoiding single point of failures and custom development along with extreme automation of operations and exhaustive failure testing.

Some of the types of interruptions applications have to handle is listed below

Hardware failures
Deployment failures
Increased load
Unexpected input errors
Credentials/Certificates expiration
Dependency Failures
Infrastructure failures like Power

All these failures must be handles, and recovery must be automated. This means that the work in achieving HA is not trivial and leads to the refining of requirements of availability goals. It could be that some transactions have to be more reliable than others, peak times, geographical variances etc..

Understanding Availability needs

The whole system doesn’t need to be engineered to be highly available. Examples are Real-time operations may be more important than batch, non-peak hours may have tolerance for availability, data planes vs control planes e.g. health of existing EC2 and operations may be more important than being able to launch new instances and so on.

Application Design for Availability

By taking advantages of proven practices, Application Availability can be improved significantly. The following principles guide the application design for HA

Fault Isolation Zones: By leveraging AWS Availability Zones and Regions, the impact can be reduced to specific areas. Leveraging replication techniques across regions reduces the risk of data loss
Redundancy: Avoiding single point of failure is the key. Building software that’s resilient to failure of a single fault zone is important
MicroServices: By building using MicroServices architecture, we can differentiate availability requirements between different components. There are trade-off in the architecture, check out https://martinfowler.com/articles/microservice-trade-offs.html
Recovery Oriented Computing: Reducing recovery impact time is critical. Recovery procedures are based on impact rather than the type of issue occurred. In some cases like EC2, terminating it and getting ASG spawning a new instance may be better than trying to identify the exact issue and fix. Testing recovery paths regularly is critical to assess the validity of the procedure all the times
Distributed Systems best practices: Some of the best practices around distributed systems are — Throttling (rate limits), Retry certain number of times before returning failures, Fail fast during certain conditions, Idempotency tokens to ensure exactly-once processing, keeping the service always warmed up with constant work, Circuit breakers to check dependency availability, Static stability to ensure that the system does less work during failures rather than overloading the system further

Operational Considerations for Availability

It’s important to plan the automated or human workflow used in the full lifecycle of the application. Testing is an important part of the delivery pipeline. Apart from the unit and functional testing, performance testing, sustained load testing and failure injection testing is important. Operational readiness review, must evaluate the completeness of testing, monitoring and be able to audit the application performance to SLA

Automate Deployments

Use advanced deployment techniques like

Canary: Incrementally introduce changes and move forward and rollback based by monitoring the impact
Blue-Green: Establish a parallel new stack and switch traffic at once (and rollback quickly as necessary)
Feature-Toggle- Use configuration options to deploy new features by turning on/off as needed
Failure Isolation Zone deployment: Use fault isolation zones to isolate deployments and plan capacity around the failure of zones

Testing

Testing must be aligned with your availability goals. Unit testing, load testing, performance testing, failure testing and external dependency testing are necessary.

Monitoring and Alerting

Monitoring needs to effectively detect failures and alert. The last thing you want is the customer knows the issue before you do. Monitoring and Alerting systems should be decoupled from the main services and service disruptions shouldn’t impact monitoring and alerting.

Monitoring has five phases

Generation

Determine the services that require monitoring, define metrics, create thresholds and corresponding alarms. Almost all AWS services make a whole lot of monitoring and log information available for consumption. For example Amazon ECS and AWS Lambda stream logs to CloudWatch logs, VPC Flow logs can be enabled on any or all ENIs in VPC

Aggregation

Amazon CloudWatch and S3 are primary aggregation layers. Some services like ASGs and ELBs provide out of box metrics, other streaming services like VPC Flow logs/CloudTrail event data is forwarded to CloudWatch logs which can be filtered to extract metrics. This can provide time series data from which alarms can be triggered.

Real-time processing and Alarming

Alarms can be integrated with SNS for multiple subscribers or sent to SQS for third-party integration or use AWS Lambda to act immediately.

Storage and Analytics

CloudWatch logs can send logs to Amazon S3 and EMR can be used to gain further insight into the data. Third-party tools like Splunk/Logstash can be used for aggregation/processing/storage and analytics.Data retention requirements are key and older data can be moved to Amazon Glacier for long-term archival purposes.

Operational Readiness Reviews (ORRs)

ORRs are important to ensure that the applications are ready for production. Teams need to have an initial ORR checklist and must be repeated to validate the accuracy. One team’s ORR must incorporate lessons learned from other applications.

Auditing

Auditing monitoring aspects to validate availability goals is critical. Root cause analysis requires the ability to discover what happened. AWS provides the following services to track state of services during an incident

CloudWatch logs: Sore logs and inspect
AWS Config: Find state of the infrastructure at any point
AWS CloudTrail: Find which AWS APIs were invoked by whom

Stay tuned for Part 2 — High Availability Architectures to achieve specific SLAs

References

Well Architected Framework