Patterns for the Distributed Systems in Cloud — Part 2

Sathiya Shunmugasundaram
7 min readJan 2, 2020

--

This story is a continuation of https://medium.com/@vagees/patterns-for-the-distributed-systems-in-cloud-part-1-b51dc454f0c7

The primary reference to this article is https://aws.amazon.com/builders-library/ and the goal is to provide a quick reference to the patterns

Part 1 covered 3 Architectural patterns around the distributed systems and asynchronous systems

Timeouts, retries, and backoff with jitter (Architectural)

Failures can happen due to many reasons, for example, servers, networks, load balancers. Sometimes operators could make mistakes. It is impossible to design systems that never fail. The following solutions will help design systems to handle failures

Timeouts

It’s a good practice to set a timeout across remote calls and also any requests across the processes even within the same server. Arriving at a timeout value is usually challenging. A timeout that is too high may not be useful as the system still consumes resources while waiting. A timeout that is too low may cause too many retries and could lead to a complete outage.

Latency metrics is a good starting point. Setting an acceptable false timeout to 0.1% and then looking at latency profile can be useful in arriving at a reasonable timeout value. This approach could have downsides when clients are spread across the globe, such that network latencies are high and vary significantly. Also, where latency is tightly bounded (like 99th percentile is too close to 50th percentile), timeouts can be caused by even small latencies. Also, certain services like DNS and TLS handshakes may not be covered by timeout.

It’s preferable to use timeouts in well-tested clients. Also, some timeouts may be associated with application startup, but not during when the application is running actively. By designing systems that establishing the connection at startup, the main timeouts can be set low.

Retries and Backoff

While retry provides a higher chance of success, it is helpful only when the problem is transient. But if failures are caused by overload, retry can make the situation worse and could delay the recovery significantly.

The preferred solution is backoff. Between each retry, the client waits some more time rather than trying very aggressively. An Exponential Backoff pattern increases the backoff time exponentially after each retry. Clients can also implement a cap for maximum retry known as Capped Exponential Backoff. A better solution is to handle the failure after a capped number of retries.

When there are multiple layers, retrying at each layer will increase the load downstream significantly. It’s a good practice to implement retries at a single point in a stack

Even with a single layer, traffic could significantly increase when errors start. Circuit Breakers are used to stop traffic to a degraded service. But they could make the systems difficult to test. A Token Bucket technique is widely employed by limiting retries locally.

Retries are not safe if the systems are not idempotent. Read-only systems are idempotent. Other systems that are writing may use a token-based system to provide safe retries.

It is vital to know which failures are worth retrying. For example, client errors are likely to fail again, whereas server errors may succeed.

It is good to remember that retry is considered selfish, and too much of it could cause more problems.

Jitter

When all failed calls are retried at the same time, it may again cause problems. The solution is Jitter. Jitter adds a randomness to the backoff by spreading the retries. For more details see

Jitter can also be applied to timers, periodic jobs, and other delayed work so that the work can be spread.

For more details on these solutions, see

Avoiding fallback in distributed systems (Architectural)

Critical failures can cause severe business disruptions like a failed DB query causing a website not to display product page for customers. There are four broad categories of strategies to handle critical failures

  • Retry: Perform the operation again immediately or some after
  • Proactive retry: Perform retry several times in parallel and use the first finish
  • Failover: Perform the activity in an alternate endpoint or against multiple copies in parallel
  • Fallback: Use a different mechanism to achieve the same result

While fallback strategies are necessary, depending on them frequently could cause serious issues. Many times, the error situations appear a long time after the system is running and could be challenging to test/simulate.

Single Machine Applications

A fallback may also fail, or a fallback can be delayed due to a latent bug. The fallback can also cause additional load or other issues that can be equally detrimental. For example, an error-handling routine might log the trace during fallback, which, if exceed a threshold, can fill up the disk.

It is usually safe to ignore the errors in a single machine application rather than trying fallback. The cause of the problems also shares the same machine, and when the machine crashes, it can fail together. Failing fast is a good strategy for single machine apps, and the trade-off is reasonable.

Distributed fallback

Critical systems (e.g., AWS Services like EC2) have to keep running even if customer applications have issues. Implementing fallback is considered as a way to make the service more reliable. But distributed fallback has many problems when it comes to critical failures as below

  • Distributed fallback strategies are hard to test
  • Distributed fallback strategies can themselves fail
  • Distributed fallback strategies often can make the outage worse
  • Distributed fallback strategies are usually not worth the risk
  • Distributed fallback strategies may have latent bugs

Averting the Fallbacks

The following approaches describe how to design around the fallback strategies

  • Improve the reliability of non-fallback cases — Making the non-fallback code more robust will likely provide more reliability than the fallback. For example, rather than using a database with a cache layer, use a robust database like DynamoDB
  • Let the caller handle the errors — Expect the caller to retry the request. For this to be effective, a lot of effort has to be put in place to design the system, and a fallback is unlikely to improve the reliability
  • Push data proactively — If a system needs data consistently, pushing it proactively will provide better reliability than the system asking for the data. e.g., IAM credentials are pushed to EC2 instances frequently, and any disruption in pushing the data is unlikely to affect the IAM requests
  • Convert fallback into failover — One of the pitfalls of fallback is that it may not be exercised in production for several weeks or months. So running both typical and fallback scenarios regularly is essential. If not entirely, random samples should run through the fallback scenario to validate fallback all the times
  • Ensure that retries and timeouts do not become fallbacks — Services might run a very long time without needing to retry. Proactive retry (or Constant Work) is an approach which makes redundant requests frequently and uses the successful responses. This approach makes sure that the retries do not become overload

As a final note, if fallback is an essential strategy in your system, exercise it often. For more details on this section, see

Static stability using Availability Zones(Architectural)

Availability Zones (AZs)

AZs are isolated sections of the AWS region. They are separated by a reasonable distance and dedicated power supply such that failures in one Zone have no impact on others and also provide high-speed connectivity between the AZs

Static Stability

Static Stability is an approach to handle impairments before they happen. It’s a best practice to design Systems around Control and data planes.

  • Control plane — contain services like launching a new server like EC2, setting up its permissions, security rules attaching a volume and, necessary network configurations.
  • Data plane — this is where the actual server operates and does things like writing to disk and routing packets across the network.

If the control plane is impaired, it shouldn’t impact the data plane. For example, not being able to launch a new server shouldn’t affect the functionality of a running server.

If we decide to replace a server (in the data plane) in case of AZ level issues, very likely that the Control plane in the same AZ might have been impacted by the same issue and we may be unable to launch a new EC2 instance.

This situation implies that we lack static stability. Relying on the control plane to address a capacity issue adds a risk when an issue affects both control and data planes.

Active-active on Availability Zones example: A load-balanced service

The following diagram shows a simple example of a load-balanced HTTPS service. The service is deployed using AutoScalingGroup(ASG) across 3AZs. If the service is impaired in 1 AZ, the health check marks the services unavailable in that AZ. By overprovisioning capacity by 50%, the remaining AZs serve the full capacity.

The same pattern works even if batch systems. By leveraging ASGs across multiple AZs, the healthy zones take the extra load when an AZis impaired

Active-standby on Availability Zones example: A relational database

In this case, a primary node serves the customers, and a secondary passive node in a different AZ is kept stand by with synchronous replication. When the primary is impaired, the RDS system flips the DNS to secondary, which takes over as primary.

This pattern can be used to set up Active-Standby systems and cluster architectures which elect a leader. When a leader fails, another node takes over as a leader instead of needing to bring a new node.

Static Stability Considerations

When designing services that provide higher availability to other services, it is vital to consider zonal dependency. It may be needed to design the core services zonal. A classic example is AWS NAT Gateway, which is a Zonal service. Since NAT GW is part of the EC2 instance’s data plane, if it is a zone independent service, AZ failure is likely to impact EC2 instances. Since it is Zonal and AZ specific, any AZ failure is localized

For more details see

--

--

Sathiya Shunmugasundaram

Freelance writer in DevOps, Cloud, Resiliency, MicroServices and Containers