Service Resilience: Strategies to Prevent Cascading Failures in Large-Scale Distributed Systems

14 min readApr 23, 2023

In the fast-paced world of technology, large Internet service providers like Google and Amazon strive to expand rapidly while maintaining a widely distributed system. Ensuring reliable service operations is exceptionally relevant to these industry leaders, yet they occasionally experience significant setbacks. System failures can lead to widespread outages that ultimately impact the customer experience and weaken confidence in their system capabilities.

This has already affected Gmail (2012) [1], AWS DynamoDB (2015) [2], and Facebook (2021) [3], to name a few examples. In this context, one may encounter so-called cascading failures causing undesirable complications beyond ordinary system malfunctions. But how is it that even the big players in the online business can only partially avoid such breakdowns, given their budgets and technical knowledge? And what practical approaches to risk mitigation can you use for your own system?

With that said, the goal of this article is to learn how to increase the resilience of your large distributed system by preventing the propagation of failures.

Cascading failures

A cascading failure is a failure that increases in size over time due to a positive feedback loop. The typical behavior is initially triggered by a single node or subsystem failing. This spreads the load across fewer nodes of the remaining system, increasing the likelihood of other system failures resulting in a vicious circle or snowball effect [4]. Cascading failures are highly critical for three reasons: First, they can shut down an entire service quickly. Second, the affected system does not return to normal as it does with more commonly encountered problems, but it gets progressively worse. This ultimately leads to being dependent on human intervention. Finally, in the worst case, cascading failures can strike seemingly without warning because load distribution, and consequently failures occur rapidly [4][5].

Although this blog article will focus on cascading failures in the context of distributed computing, they can also occur in a variety of other domains: e.g., power transmission, finance, biology, and also ecosystems. They are a reasonably widespread phenomenon similar to natural patterns [5]. To better understand what a cascading failure in computer science looks like, let’s look at a specific case study.

Case Study: The AWS DynamoDB Outage in 2015

AWS DynamoDB is a highly scalable non-relational database service distributed across multiple data centers that offers strongly consistent read operations and ACID transactions [6]. It is, and at the time of the event was, being used by popular internet services such as Netflix, Airbnb, and IMDb [7]. The incident we want to look at as an example of a cascading failure occurred on September 20, 2015, when DynamoDB was unavailable in the US-East region for over four hours. There were two subsystems involved: storage servers and a metadata service. Both are replicated across multiple data centers. The storage servers request their so-called membership for data partition allocations from the metadata service. This is shown in Figure 1.

Figure 1: Storage servers and metadata service
(based on [8])

For the membership (and thus also for the allocation of data partitions) there are timeouts during which the request must be successful. If these are exceeded, the corresponding storage server retries and excludes itself from the service.

An unfortunate precondition for the incident was a newly introduced DynamoDB feature called Global Secondary Index (GSI). This gives customers better access to their data but has the downside of significantly increasing the size of metadata tables. Consequently, the processing time was much longer. Regarding the capacity of the metadata service and the timeouts for membership requests, unfortunately no corresponding adjustments were made [9].

The problem began when a short network issue caused a few storage servers (dealing with large metadata tables) to miss their membership requests. As a result, these servers became unavailable and kept retrying their requests. This overloaded the metadata service, which slowed down responses and caused more servers to resubmit their membership requests because they had also exceeded their timeouts. As a consequence, the state of the metadata service deteriorated even further. Despite several attempts to increase resources, the system remained caught in the failure loop for hours. Ultimately, the problem could only be solved by interrupting requests to the metadata service, i.e., the service was taken offline [9].

The result is a widespread DynamoDB outage in the US-East region and an excellent example of a cascading failure. However, what are the underlying concepts and patterns of the systems that are getting caught in such an error loop?

Reasons for cascading failures

First, the trigger points for cascading breakdowns can look diverse: new rollouts, maintenance, traffic drains, cron jobs, distributed denial-of-service (DDoS), throttling, etc. They all have in common that they work in the context of a finite set of resources, potentially implying effects such as server overload, resource exhaustion, and unavailability of services [4][10]. Let’s look at those in detail:

Server overload

The most common cause is server overload or a consequence of it. When that happens, the drop in system performance often affects other areas of the system. As shown in Figure 2, in the initial scenario (left), load coming from two reverse proxies is distributed between clusters A and B so that cluster A operates at an assumed maximum capacity of 1000 requests per second. In the second scenario (right), cluster B fails, and the entire load hits cluster A, which can lead to an overload. Cluster A now has to process 1200 requests per second and starts to misbehave, causing the performance to drop below the desired 1000 requests per second [4].

Figure 2: Clusters A and B receiving load according to capacity (left) and cluster A receiving overload if cluster B fails (right). (based on [4])

Resource Exhaustion

The resources of a server are limited. If the load exceeds a certain threshold, the server’s performance metrics, such as latency or error rates, deteriorate. This translates into a higher risk of a crash. The subsequent effects depend on the type of resource that is causing the bottleneck; for instance,

if CPU is not sufficient, a variety of issues can occur, including slower requests, excessive queuing effects, or thread starvation.
If memory/RAM is overused, tasks may crash, or cache hits can decrease.
Also, thread starvation may directly cause errors or lead to health check failures [4].

Troubleshooting for the leading cause in this context is often painful. This is since the components involved are interdependent, and the root cause may be hidden behind a complex chain of events [4]. For example, assume that less memory is available for caching, resulting in fewer cache hits and, thus, a higher load for the backend and such combinations [10].

Service Unavailability

When resource exhaustion causes a server to crash, traffic spreads to other servers, increasing the likelihood that those will also crash. A cycle of crashing servers establishes. The bad thing is that these problems remain in your system because some machines are still down or being restarted while increasing traffic prevents them from fully recovering [4].

In general, the cascading failure risk is always present when we redistribute traffic from unhealthy to healthy nodes. For example, this may be true with orchestration systems, load balancers, or task scheduling systems [5]. Therefore, we need to look closely at the components’ relationships to solve cascading failures.

Getting out of the loop: How to fix cascading failures

As seen in the case of DynamoDB, fixing cascading failures is tricky. Especially from the perspective of a large tech company, distribution adds a lot of complexity to your system, which makes it even more challenging to keep track of the diverse interconnections. One primary way to illustrate (the cascading) relationships here is the so-called Causal Loop Diagram (CLD). The CLD is a modeling approach that helps to visualize feedback loops in complex systems. Figure 3 visualizes the CLD for the AWS DynamoDB outage. It can be explained as follows. An arrow represents the dynamic between the initial and subsequent variables. For instance, if the latency on the metadata service increases, the number of timeouts and the number of retries needed increases. If the effects in the system are highly unbalanced, i.e., the number of pluses and minuses is not equal by a large margin, there is a reinforcing cycle. This means the system might be sensitive to cascading failures [5].

Figure 3: The Causal Loop Diagram for the AWS DynamoDB outage in 2015
(based on [5])

Now, to address the cascading scenario, various measures can be taken. The first and most intuitive option is to increase resources. In the diagram above, you can see the minus that is introduced to the circle by the metadata service capacity. If this is increased, it works against the reinforcing cycle. However, this might be useless, as we have seen in the case of AWS. Therefore, in addition to increasing resources, you may need to employ other strategies:

Try to avoid health check failures/deaths to prevent your system from dying due to excessive health checking.
Restart your servers in case of thread-blocking requests or deadlocks.
Drop traffic significantly and then slowly increase the load so that the servers can gradually recover.
Switch to a degraded mode by dropping certain types of traffic.
Eliminate batch/bad traffic to reduce system load due to non-critical or faulty work [4].

Since this ultimately means that parts of the system are not available and this becomes visible to the customer, it is better to avoid cascading failures in the first place.

Avoiding cascading failures

There are numerous approaches to render distributed systems robust against cascading failures.

On the one hand, large internet companies have already considered preventing a system from falling into a cascade of errors, e.g., by isolating errors. Tools and frameworks have been developed for this purpose. Two examples are Hystrix (from Netflix), a latency and fault tolerance library, or Sentinel [11][12]. Regarding the former, Netflix has already made further developments, namely the adaptive concurrency limits (you can read more here). But in general, these kinds of tools wrap external calls in some sort of data structure, trying to abstract the critical points.

On the other hand, and this is where the hype is going, there are more complex solutions, such as the implementation of so-called sidecar proxies, e.g., service meshes like istio. Some examples of technologies are Envoy or Haproxy [10][13].

In addition to these solutions, there are certain system design concepts you can keep in mind. For example, reduce the number of synchronous calls in your systems. This can be done by moving from an orchestration pattern to a choreography pattern by applying a publish–subscribe pattern design, e.g., by using Kafka. This solution often turns out more robust in the face of increasing traffic. Other approaches, such as performing capacity planning (depending on the use case), can also be helpful. This often implies implementing solutions for automatic provisioning and deployment, automatic scaling, and automatic healing. Finally, close monitoring of SLAs and SLOs can be considered significant [10][4].

Now, to better understand the underlying solution approaches, we can look at typical antipatterns in distributed systems that should be avoided in the context of cascading failures. Laura Nolan proposes six of these, also discussed in terms of risk mitigation strategies in the following.

Antipattern 1: Acceptance of an unrestricted number of requests

The number of tasks in the queue/thread pool should be limited. This allows you to control when and how the server slows down in case of excessive requests. The setting should be in a range where the server can reach peak loads but not so much that it blocks. In this case, failing fast is better than hanging for a long time for both the system and the user [5]. On the proxy or load balancer side, this is frequently implemented by rate limiting strategies, e.g., to avoid DDoS and other forms of server overload [11]. But there is also more to consider, for example, in the context of queue management, as most servers have a queue in front of a thread pool to handle requests. If the number increases beyond the queue’s capacity, requests are rejected. Many requests waiting in the queue require memory and increase latency. If the number of requests is close to constant, a small or no queue is sufficient. This means that requests will be rejected immediately if there is an increase in traffic. A longer queue should be used if more substantial deviations are expected [4].

In addition, load shedding and graceful degradation are viable options to protect servers from excessive load. Load shedding is used to maintain the server’s performance as best as possible in case of overload. This is achieved by dropping traffic from simply returning an HTTP 503 (Service Unavailable) status code to prioritizing requests individually. A more complex variant of this is called graceful degradation, which means it switches incrementally to lower-quality responses for queries. These might run faster or more efficiently. However, this should only be a well-considered solution because it can add a lot of complexity to your system [4].

Antipattern 2: Dangerous (client) retry behavior: To reduce the system’s workload, avoiding excessive retry behavior is vital. An exponential backoff is a suitable approach in which the time intervals for retries are successively incremented. You should also use the so-called jitter, i.e., add random noise to the retry intervals. This prevents your system from being hit by accumulating “load waves,” which is also known as retry amplification (see Figure 4)[5][10].

Figure 4: Typical pattern of retry amplification (based on [10]).

Also, there is a design pattern called a circuit breaker. Circuit breakers can be thought of as a type of switch. In the initial state, commands from an upstream service can pass through to a downstream service. If the errors increase, the circuit breaker switches to an open state, and the system fails fast. This means the upstream service gets an error, allowing the downstream service to recover. After a specific time, the requests are gradually ramped up again. For instance, in the library Hystrix (already mentioned above), some kind of circuit breaker pattern is implemented [11].

Another approach to mitigating dangerous retry behavior would be to set a server-side retry budget, meaning you only retry a certain number of requests per minute. Everything that exceeds the budget is dropped. However, in all cases, a global view is essential here. Executing retries on multiple levels of the software architecture should be avoided at all costs, as this can grow exponentially [4].

Finally, it should be noted that retries should be idempotent and free from side effects. It can also be beneficial in terms of system complexity to make calls stateless [10].

Antipattern 3: Crashing on Bad input

The system should ensure that servers do not crash due to bad input. Such crashes, combined with retry behavior, can lead to catastrophic consequences, such as one server crashing after another. In particular, inputs from outside should be carefully checked in this regard. Fuzz tests are an excellent way to detect these problems [5].

Antipattern 4: Proximity-based failover

Ensure that not all of your traffic is redirected to the nearest data center, as it can also become overloaded. The same logic applies here as with the failures of individual servers in a cluster, where one machine can fail after the other. So, to increase the resilience of your system, the load must be redirected in a controlled manner during failover, which means you have to consider the maximum capacity of each data center. DNS, based on IP-Anycast, eventually forwards the traffic to the closest data center, which could be problematic [5].

Antipattern 5: Work prompted by a failure

Failures often cause additional work in the system. In particular, a failure in a system with only a few nodes can lead to a lot of additional work (e.g., replication) for the remaining nodes. This can lead to a harmful feedback loop. Therefore, a common mitigation strategy would be to delay or limit the amount of replication [5].

Antipattern 6: Long startup times

In general, processes are often slower at the beginning. This is, for instance because of initialization processes and runtime optimizations [10]. After a failover, services and systems often collapse due to the heavy load. To prevent this, you should prefer systems with a fast startup time [5]. Also, caches are often empty at system startup. This makes queries more expensive as they have to go to the origin. As a result, the crash risk is higher than when the system is running in a stable mode, so make sure to keep caches available [4].

In addition to these six antipatterns, other system components or parameters should be checked. For example, you can look at your deadlines for requests or RPC calls. In general, it takes work to set good deadlines here. But one common problem you frequently encounter in the context of cascading failures is that the client needs more deadlines, which means that many resources are wasted [4]. This was also the case in the AWS DynamoDB example from the beginning. In general, the server should check if there is still time left until the deadline is reached to avoid working for nothing. A common strategy is so-called deadline propagation. There is an absolute deadline at the top of the request tree. The servers further down only get the time value left after the previous server has done its calculations. Example: Server A has a deadline of 20 seconds and needs 5 seconds for the calculation, then server B has a deadline of 15 seconds, and so on [4].

Key Takeaways: Learn, Anticipate, and Build

Cascading failures are a dreaded and at the same time special phenomenon in distributed systems. That’s because sometimes counterintuitive paths must be taken to avoid them, e.g., customizations actually intended to reduce errors, such as what appears to be intelligent load balancing, can increase the risk of total failures. And sometimes it’s just better to simply show an error message to your customer, instead of implementing a sophisticated retry logic and risking a DDoS against your own system. However, compromises often have to be made here. Testing, capacity planning, and applying certain patterns in system design can help to improve the resilience of your system.

In summary, the insights gained from the analyses and postmortems of large technology companies serve as a valuable guide for implementing preventive measures against cascading failures in the future. However, it is critical to recognize that this offers only a superficial understanding of the complicated dynamics of cascading failures. The real challenge is to understand and anticipate the impact of change at all possible levels.

May this article serve as a useful resource to help and empower you to make informed decisions as you navigate the complex world of cascading failures. Embrace the lessons of the past to build a more resilient and robust future.

List of Sources

[1] http://static.googleusercontent.com/media/www.google.com/en/us/appsstatus/dashboard/ir/plibxfjh8whr44h.pdf

[2] S. (2015, September 27). SentinelOne | Irreversible Failures: Lessons from the DynamoDB Outage. SentinelOne. https://www.sentinelone.com/blog/irreversible-failures-lessons-from-the-dynamodb-outage-2/

[3] Beckett, L. (2021, October 5). Facebook platforms back online — as it happened. The Guardian. https://www.theguardian.com/technology/live/2021/oct/04/facebook-down-instagram-whatsapp-not-working-latest-news-error-servers

[4] Murphy, N. R., Beyer, B., Jones, C., & Petoff, J. (2016). Site Reliability Engineering: How Google Runs Production Systems (1st ed.). O’Reilly Media.

[5] Nolan, L. (2021, July 11). Managing the Risk of Cascading Failure. InfoQ. https://www.infoq.com/presentations/cascading-failure-risk/

[6] Amazon DynamoDB — Häufig gestellte Fragen| NoSQL-Schlüssel-Werte-Datenbank | Amazon Web Services. (2021). Amazon Web Services, Inc. https://aws.amazon.com/de/dynamodb/faqs/

[7] Patra, C. (2019, April 19). The DynamoDB-Caused AWS Outage: What We Have Learned. Cloud Academy. https://cloudacademy.com/blog/aws-outage-dynamodb/

[8] Nolan, L. (2020, February 20). How to Avoid Cascading Failures in Distributed Systems. InfoQ. https://www.infoq.com/articles/anatomy-cascading-failure/

[9] Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region. (2015). Amazon Web Services, Inc. https://aws.amazon.com/de/message/5467D2/

[10] The Anatomy of a Cascading Failure. (2019, August 5). YouTube. https://www.youtube.com/watch?v=K3tgWsMxaAU

[11] Osman, P. (2018). Microservices Development Cookbook: Design and build independently deployable, modular services. Packt Publishing.

[12] Arya, S. (2020, January 23). Hystrix: How To Handle Cascading Failures In Microservices. All About Buying & Selling of Used Cars, New Car Launches. https://www.cars24.com/blog/hystrix-how-to-handle-cascading-failures-in-microservices/

[13] Architecture. (2020). Istio. https://istio.io/latest/docs/ops/deployment/architecture/