Architecting for Reliability Part 2 — Resiliency and Availability Design Patterns for the Cloud
This is part 2 of the Architecting for Reliability Series
Refer to https://en.wikipedia.org/wiki/Software_design_pattern for a detailed overview of What a Design Pattern is and reference to several software design patterns. This story is specifically to review some of the popular design patterns relevant to Resiliency and Availability in the context of the cloud.
The references to this story are from the below links and all images are courtesy of the respective content owners.
- https://docs.microsoft.com/en-us/azure/architecture/patterns/
- http://en.clouddesignpattern.org/index.php/Main_Page
Availability Patterns
Availability represents the time the system is functional and working. It can be affected by System Maintenance, Software Updates, Infrastructure issues, Malicious Attacks, System load and dependencies with third-party providers. Availability is typically measured by SLA and using 9s. For example, Five 9s mean 99.999% availability which means the system can be down for about 5 min in a year. Check https://uptime.is/ to find availability for a specific SLA
Health Endpoint Monitoring
In the cloud, an application can be impacted by several factors like latency, provider issues, attackers and application issues. It is necessary to monitor that application is working at regular intervals.
Solution Outline
- Create a Health Check Endpoint
- Endpoint must do useful health check that includes any subsystems like storage, databases and third-party dependencies
- Return Application availability using Status Code Content
- Monitor the End Point at appropriate intervals along with latencies from locations close to Customer
- Secure endpoint to prevent attacks
For complete overview, refer to the Microsoft link
The following Diagram depicts the same Pattern in AWS Specific Implementation. This explains how deep should the health check be rather than a static page at the front.
Refer to the link for more details.
Other references
http://microservices.io/patterns/observability/health-check-api.html
https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html
https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/health-checks-creating.html
Queue-Based Load Leveling
A service can be overloaded by heavy load / frequent requests which could affect the availability. By queuing such requests and processing them asynchronously will help improve the stability of the system.
Solution Outline
- Introduce a Queue between the task and service
- The tasks are placed in the Queue
- The service processes the tasks at the desired pace. The Service can possibly autoscaled based on Queue Size in some advanced implementations.
- If a response is expected, the service must provide a suitable implementation, however, this pattern isn’t suitable for low latency response requirements
For complete overview, refer to the link
Other references
https://msdn.microsoft.com/library/dn589781.aspx
http://soapatterns.org/design_patterns/asynchronous_queuing
Throttling
Limit the number of resources utilized by a Service or its components or its clients such that the Service can continue to function meeting SLAs even during extreme load
Solution Outline
- Set a limit to individual user access, monitor metrics and reject when limit is exceeded
- Disable or degrade nonessential services so that critical services can function, for example, a video call can switch to audio only during bandwidth issues
- Prioritize certain users and use load-leveling to satisfy high impact customers’ requirements
For Complete overview check the link
Other references
https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-request-throttling.html
Resiliency Patterns
Resiliency is the ability of a system to recover from failures gracefully. Detecting failures and recovering quickly and efficiently is key.
Bulk Head
Isolate application components such that failure of one doesn’t impact others. Bulk Head denotes a ship sectioned partitions of a ship. If one partition is compromised, water will only in that partition, saving the ship from sinking
Solution Outline
- Partition service instances into groups and allocate resources individually such that a failure will not consume the resource outside of this pool
- Define partitions according to business and technical requirements, for example, high priority customers may get more resources
- Leverage frameworks like Polly/Hystrix and also use technologies like Containers to provide the isolation. For example, Containers can be set hard limits for CPU/Memory consumptions so that failure of one doesn’t run away with resources.
For complete overview, refer to the link
Other Resources
Polly — A .NET fault tolerant library
Circuit Breaker
When a service is deemed to have failed and can negatively impact other applications if it were to continue to run, it should throw exception and can be resumed later when the problems appear to be fixed, the service can be resumed
Solution Outline
The following diagram shows the implementation of Circuit Breaker using State Machines
For complete overview, pls check the link
Other References
http://microservices.io/patterns/reliability/circuit-breaker.html
https://spring.io/guides/gs/circuit-breaker/
Compensating Transaction
In a distributed system, Strong Consistency is not always optimal. Eventual consistency yields better performance and integration of components. When something fails, it is necessary to undo the previous steps.
Solution Outline
A Compensating Transaction will record all steps of workflow and start undoing the operations if there’s a failure
The following diagram depicts a sample use-case with sequential steps.
A Compensating transaction doesn’t have to undo the exact same order, it may be possible to execute parallel calls.
For Detailed overview, check the link
Other References
Leader Election
Coordinate the actions performed by multiple similar instances. For example, several instances may be doing similar tasks and may need coordination and also avoid contention to shared resources. In some other cases, it may be needed to aggregate the results of work of several similar instances.
Solution Outline
A single task instance should be elected as leader. This will coordinate the actions with other subordinate instances. Since all instances are similar and peer, there must be a robust leader election process
The Leader Election process can use several strategies as below
- Select lowest ranked instance or process ID
- Acquire a Mutex — Care should be taken to release the Mutex when a leader is disconnected or fails
- Implement common leader election algorithms like Bully or Ring
Also, take advantage of any third-party solution like Zookeeper to avoid developing complex internal solution
For Detailed overview, check the link
Other References
Retry
Enable the application handle transient failures by retrying a failed operation to improve the stability of the Application
Solution Outline
Various approaches are
- Cancel if the fault is deemed not transient and unlikely to be fixed
- Retry immediately if the fault seems unusual and an immediate retry might succeed
- Retry after a delay if the fault looks like a temporary issue and likely to be fixed after a short interval, for example, an API rate limit issue.
The Retry attempts should use caution and should not strain an already loaded application. Also, consider the safety of the operation to retry (Idempotent)
For complete reference check the link
Other References
Scheduler Agent Supervisor
Coordinate a larger operation. When things fail, try to recover for example using Retry pattern and succeed, but if the system is not recoverable, undo the work so that the entire operation fails or succeeds in a consistent fashion.
Solution Outline
The solution involves 3 actors
- Scheduler — arranges for execution of various steps in the workflow and orchestrates the operation. The scheduler also records the state of each step. Scheduler communicates with Agents to execute steps. Scheduler/Agent communication typically happens asynchronously using queues/messaging platforms
- Agent — encapsulates the logic to call a remote service reference by a step. Each step might use a different agent
- Supervisor — monitors each step in the task performed by the scheduler. During failures, it requests appropriate recovery which will be performed by agent or orchestrated by Scheduler
The following diagram shows a typical implementation
For complete reference check the link
Other References
- Microsoft Azure Scheduler
- Process Manager pattern
- Cloud Architecture: The Scheduler-Agent-Supervisor Pattern
AWS Specific Patterns
The following section describes some patterns showing specific AWS Implementations. Some of them are relatively simple but worth checking out.
Multi-Server Pattern
In this Approach, you provision additional servers behind a Load balancer to improve availability within a Datacenter / Availability Zone
You need to be watchful of shared data and sticky sessions. Leverage other data access patterns to address such issues
For detailed overview, check the link
Multi-DataCenter Pattern
This expands on the Multi-Server Pattern to address Datacenter failures by creating servers in multiple datacenters/availability zones
Data sharing concerns still remain as described in the previous section.
For complete overview, check the link
Floating IP Pattern
In this, the application assigns the server with a Floating IP that can be reassigned to another working server in case of failure. While the idea is primitive and depends on Elastic IP feature, the pattern can be expanded to achieve advanced architectures.
For complete overview, check the link