Patterns for Resilient Architecture — Part 2
The art of avoiding cascading failures
Last week, I published the first part of my series on patterns for resilient architecture— with a focus on the infrastructure layer, embracing redundancy, immutability and the concept of infrastructure as code.
In this post, I will focus on cascading failures, or what I like to call The Punisher— that super hero living in the darkness of your architecture taking revenge on those responsible for not thinking enough about the small details. I have been a victim several times, and The Punisher is bad, real bad!
Avoiding cascading failures
One of the most common triggers for outages is cascading failure, where one part of a system experiences a local failure and takes down the entire system through inter-connections and failure propagation. Often, that propagation follows the butterfly effect, where a seemingly small failure ripples out to produce a much larger one.
A classic example of a cascading failure is overload. This occurs when traffic load distributed between two clusters brutally changes due to one of the clusters…