Architecting for Reliability Part 2 — Resiliency and Availability Design Patterns for the Cloud

Sathiya Shunmugasundaram
becloudy
Published in
7 min readJan 31, 2018

This is part 2 of the Architecting for Reliability Series

Refer to https://en.wikipedia.org/wiki/Software_design_pattern for a detailed overview of What a Design Pattern is and reference to several software design patterns. This story is specifically to review some of the popular design patterns relevant to Resiliency and Availability in the context of the cloud.

The references to this story are from the below links and all images are courtesy of the respective content owners.

Availability Patterns

Availability represents the time the system is functional and working. It can be affected by System Maintenance, Software Updates, Infrastructure issues, Malicious Attacks, System load and dependencies with third-party providers. Availability is typically measured by SLA and using 9s. For example, Five 9s mean 99.999% availability which means the system can be down for about 5 min in a year. Check https://uptime.is/ to find availability for a specific SLA

Health Endpoint Monitoring

In the cloud, an application can be impacted by several factors like latency, provider issues, attackers and application issues. It is necessary to monitor that application is working at regular intervals.

Solution Outline

  • Create a Health Check Endpoint
  • Endpoint must do useful health check that includes any subsystems like storage, databases and third-party dependencies
  • Return Application availability using Status Code Content
  • Monitor the End Point at appropriate intervals along with latencies from locations close to Customer
  • Secure endpoint to prevent attacks

For complete overview, refer to the Microsoft link

The following Diagram depicts the same Pattern in AWS Specific Implementation. This explains how deep should the health check be rather than a static page at the front.

Refer to the link for more details.

Other references

http://microservices.io/patterns/observability/health-check-api.html

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html

https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/health-checks-creating.html

Queue-Based Load Leveling

A service can be overloaded by heavy load / frequent requests which could affect the availability. By queuing such requests and processing them asynchronously will help improve the stability of the system.

Solution Outline

  • Introduce a Queue between the task and service
  • The tasks are placed in the Queue
  • The service processes the tasks at the desired pace. The Service can possibly autoscaled based on Queue Size in some advanced implementations.
  • If a response is expected, the service must provide a suitable implementation, however, this pattern isn’t suitable for low latency response requirements

For complete overview, refer to the link

Other references

https://msdn.microsoft.com/library/dn589781.aspx

http://soapatterns.org/design_patterns/asynchronous_queuing

Throttling

Limit the number of resources utilized by a Service or its components or its clients such that the Service can continue to function meeting SLAs even during extreme load

Solution Outline

  • Set a limit to individual user access, monitor metrics and reject when limit is exceeded
  • Disable or degrade nonessential services so that critical services can function, for example, a video call can switch to audio only during bandwidth issues
  • Prioritize certain users and use load-leveling to satisfy high impact customers’ requirements

For Complete overview check the link

Other references

https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-request-throttling.html

Resiliency Patterns

Resiliency is the ability of a system to recover from failures gracefully. Detecting failures and recovering quickly and efficiently is key.

Bulk Head

Isolate application components such that failure of one doesn’t impact others. Bulk Head denotes a ship sectioned partitions of a ship. If one partition is compromised, water will only in that partition, saving the ship from sinking

Solution Outline

  • Partition service instances into groups and allocate resources individually such that a failure will not consume the resource outside of this pool
  • Define partitions according to business and technical requirements, for example, high priority customers may get more resources
  • Leverage frameworks like Polly/Hystrix and also use technologies like Containers to provide the isolation. For example, Containers can be set hard limits for CPU/Memory consumptions so that failure of one doesn’t run away with resources.

For complete overview, refer to the link

Other Resources

Netflix Hystrix

Polly — A .NET fault tolerant library

Circuit Breaker

When a service is deemed to have failed and can negatively impact other applications if it were to continue to run, it should throw exception and can be resumed later when the problems appear to be fixed, the service can be resumed

Solution Outline

The following diagram shows the implementation of Circuit Breaker using State Machines

For complete overview, pls check the link

Other References

http://microservices.io/patterns/reliability/circuit-breaker.html

https://spring.io/guides/gs/circuit-breaker/

Compensating Transaction

In a distributed system, Strong Consistency is not always optimal. Eventual consistency yields better performance and integration of components. When something fails, it is necessary to undo the previous steps.

Solution Outline

A Compensating Transaction will record all steps of workflow and start undoing the operations if there’s a failure

The following diagram depicts a sample use-case with sequential steps.

A Compensating transaction doesn’t have to undo the exact same order, it may be possible to execute parallel calls.

For Detailed overview, check the link

Other References

Leader Election

Coordinate the actions performed by multiple similar instances. For example, several instances may be doing similar tasks and may need coordination and also avoid contention to shared resources. In some other cases, it may be needed to aggregate the results of work of several similar instances.

Solution Outline

A single task instance should be elected as leader. This will coordinate the actions with other subordinate instances. Since all instances are similar and peer, there must be a robust leader election process

The Leader Election process can use several strategies as below

  • Select lowest ranked instance or process ID
  • Acquire a Mutex — Care should be taken to release the Mutex when a leader is disconnected or fails
  • Implement common leader election algorithms like Bully or Ring

Also, take advantage of any third-party solution like Zookeeper to avoid developing complex internal solution

For Detailed overview, check the link

Other References

Retry

Enable the application handle transient failures by retrying a failed operation to improve the stability of the Application

Solution Outline

Various approaches are

  • Cancel if the fault is deemed not transient and unlikely to be fixed
  • Retry immediately if the fault seems unusual and an immediate retry might succeed
  • Retry after a delay if the fault looks like a temporary issue and likely to be fixed after a short interval, for example, an API rate limit issue.

The Retry attempts should use caution and should not strain an already loaded application. Also, consider the safety of the operation to retry (Idempotent)

For complete reference check the link

Other References

Scheduler Agent Supervisor

Coordinate a larger operation. When things fail, try to recover for example using Retry pattern and succeed, but if the system is not recoverable, undo the work so that the entire operation fails or succeeds in a consistent fashion.

Solution Outline

The solution involves 3 actors

  • Scheduler — arranges for execution of various steps in the workflow and orchestrates the operation. The scheduler also records the state of each step. Scheduler communicates with Agents to execute steps. Scheduler/Agent communication typically happens asynchronously using queues/messaging platforms
  • Agent — encapsulates the logic to call a remote service reference by a step. Each step might use a different agent
  • Supervisor — monitors each step in the task performed by the scheduler. During failures, it requests appropriate recovery which will be performed by agent or orchestrated by Scheduler

The following diagram shows a typical implementation

For complete reference check the link

Other References

AWS Specific Patterns

The following section describes some patterns showing specific AWS Implementations. Some of them are relatively simple but worth checking out.

Multi-Server Pattern

In this Approach, you provision additional servers behind a Load balancer to improve availability within a Datacenter / Availability Zone

You need to be watchful of shared data and sticky sessions. Leverage other data access patterns to address such issues

For detailed overview, check the link

Multi-DataCenter Pattern

This expands on the Multi-Server Pattern to address Datacenter failures by creating servers in multiple datacenters/availability zones

Data sharing concerns still remain as described in the previous section.

For complete overview, check the link

Floating IP Pattern

In this, the application assigns the server with a Floating IP that can be reassigned to another working server in case of failure. While the idea is primitive and depends on Elastic IP feature, the pattern can be expanded to achieve advanced architectures.

For complete overview, check the link

--

--

Sathiya Shunmugasundaram
becloudy

Freelance writer in DevOps, Cloud, Resiliency, MicroServices and Containers