The Resilient Architecture Collection

Adrian Hornsby
Nov 8 · 2 min read

A list of my resiliency related blog posts.


Series on Resilient Architecture

Resilient systems embrace the idea that failures are typical, and that it’s entirely OK to run applications in what we call partially failing mode. While not suitable for life-critical applications, running in a partially failing mode is a viable option for most web applications. Of course, I’m not saying it doesn’t matter if your system fails. It does, and it might result in lost revenue. But, it’s probably not life-critical.

Building resilient architectures has had its ups-and-downs, some 1 am wake-up calls, some Christmases spent debugging, some “I’m done, I quit” … but most of all, it’s been an incredible learning experience and journey.

This blog post is a collection of tips and tricks that have served me well throughout this journey, and I hope they will help you well too.


Part 1: Embracing failure at scale

In part 1 of this series, I focus on the infrastructure layer, redundancy, immutability, and the concept of infrastructure as code.


Part 2 — Avoiding Cascading Failures

In part 2, I focus on cascading failure prevention. Cascading failure happen when one part of a system experiences a local failure and takes down the entire system through inter-connections and failure propagation.


Part 3 — Preventing Service Failures with Health Check

In part 3, I discuss the importance and the challenge of health checks — striking a balance between failure detection and reaction.


Part 4 — Caching for Resiliency

In part 4, I talk about caching. While caching is often associated with accelerating content delivery, it is also essential from a resiliency standpoint.

Adrian Hornsby

Written by

Principal Evangelist, Architecture @awscloud ☁️ I break stuff .. mostly. Opinions here are my own.