Patterns for Resilient Architecture — Part 2

The art of avoiding cascading failures

Adrian Hornsby
The Cloud Architect

--

Last week, I published the first part of my series on patterns for resilient architecture— with a focus on the infrastructure layer, embracing redundancy, immutability and the concept of infrastructure as code.

In this post, I will focus on cascading failures, or what I like to call The Punisherthat super hero living in the darkness of your architecture taking revenge on those responsible for not thinking enough about the small details. I have been a victim several times, and The Punisher is bad, real bad!

Call me The Punisher — or cascading failure — either way, you are in trouble! (photo: Netflix)

Avoiding cascading failures

One of the most common triggers for outages is cascading failure, where one part of a system experiences a local failure and takes down the entire system through inter-connections and failure propagation. Often, that propagation follows the butterfly effect, where a seemingly small failure ripples out to produce a much larger one.

A classic example of a cascading failure is overload. This occurs when traffic load distributed between two clusters brutally changes due to one of the clusters…

--

--

Adrian Hornsby
The Cloud Architect

Principal System Dev Engineer @ AWS ☁️ I break stuff .. mostly. Opinions here are my own.