Resiliency and Chaos Engineering — Part 3

Pradip VS
5 min readMar 16, 2022

--

In this part, we will cover the butterfly effect or what is called the cascading failures and techniques that helps avoid it. Kindly go through part 1 and part 2 if not, where I have set the context and covered the resiliency at infrastructure layer in detail.

The Butterfly effect or popularly known as the cascading failures. Source

Note: Even though Cascading failure is different from Butterfly effect, the reason why I’m using it interchangeably is when a fault is injected in one part of system it might manifest a totally different outcome (unexpected behaviour) and pull down the entire system.

Let’s talk in detail,

The cascading failure will manifest one day into a disaster if small details are not thought through well in the architecture.

Cascading failure often happens when one part of a system experiences a local failure and takes down the entire system through inter-connections and failure propagation (butterfly effect).

A classic example of a cascading failure is overload. Let us take an example, say if there are 3 systems sharing 33% workload each and if one system fails then the remaining 2 systems needs to handle the 33% workload of system #3. If a new instance of system #3 is not spin-ed up then system 1 & 2 may fail due to overload.

Overloading and thereby the overall systems fails, a perfect example of Butterfly effect.

Two recommendations are given to this scenario. One is the to have a failover strategy where a new instance takes over if one fails and the second is to scale out with more instances, say if there are 10 instances handling 10% of workload each and even if 2 instances fail the remaining 8 can handle the 20% additional workload without issues.

By scaling out with more instances, the overload can be reduced and therefore any cascading failures.

While this is one aspect, there are other aspects one has to adopt to avoid cascading failures, they are:

  1. Back-off Algorithms — Exponential back-off algorithms gradually decrease the rate at which retries are performed, thus avoiding network congestion.

This principle is kept at the heart of ecommerce application, where even though Azure Cosmos DB supports 9 retries through its SDK, we will not issue a retry immediately. The same is followed at application layer and one example worth mentioning is when there are throttling issues (429s) in Cosmos DB, the application will not issue a retry immediately rather it follows exponential back off algorithm.

2. Timeouts — timeout a service instead of holding on to it (e.g., several insert ops on a database).

This happens when application holds the connections and the no of concurrent connections served by database exceeds the threshold, the database fails to serves request for other services. To avoid this, it is better to time them out. The timeout has to be graceful with necessary alerting mechanisms.

3. Idempotent operations — repeated over and over again without side effects or failure of the application. (use unique traceable identifiers and/or cache)- Caching will be discussed in detail in Part 4.

4. Service degradation & fallbacks — degrade over failure. Either offer a variant service or drop unimportant traffic.

Offering a variant service can be achieved using a Cache (discussed in part 4), where the site can be served with stale data momentarily.

5. Rejection — Final act of self-defense. Start with unimportant but if situation doesn’t improve, drop the important ones too.

6. Intermittent & transient errors — handle by collecting statistics & defining a threshold.

One of the key things to note here is to use thresholds. We shouldn’t react to every errors and it is better to collect statistics about intermittent errors, baseline them and then define a threshold that will trigger the reaction to errors. This takes practice to get right.

An important aspect to note in transient exception is, they are hidden by outer exceptions of different type. Therefore, make sure you inspect all inner exceptions of any given exception object in a recursive fashion before responding with an action.

7. Circuit Breaking — applying circuit breakers to potentially-failing method calls. Here circuit breaker monitors the consecutive failures between producer and consumer. If the no hits the threshold, the circuit breaker fails the request and will not pass it on to the consumer for sometime. Later it will send the request and if it passes through and the consumer responds positively then it will allow all the requests pass through else it will fail until consumer is back again.

Circuit breaker to avoid cascading issues. Source: Adrian Hornsby Blog

Other aspects followed by the ecommerce giant worth mentioning here are,

  • No queuing of requests when a downstream system slows or restarts. Here we follow the timeout or fallout / rejection patterns than queuing the requests.
  • Issue needs to be self containing. This is the core objective of avoiding cascading issues and this is achieved by failing the specific service. Since the ecommerce platform is a fully microservice platform and say recommendations or personalization system is not working, it is better to remove that from home page than making the entire page go down due to that service. Hence architecting the services and classifying them as critical and non critical plays a key role while performing resiliency engineering.

In the next part, let's talk about Caching as a resiliency pattern and how different caching techniques helps in achieving resiliency than just the common opinion that caching accelerates the speed of the content.

till then Thank you and stay tuned….

Pradip

Cloud Solution Architect — Microsoft

(Views are personal and not of my employer)

--

--

Pradip VS

Architect@Microsoft. I help & co-innovate with the customers in Generative AI, ML, Data Engineering, Analytics, Resiliency Engineering, Data Arch & Strategies.