An Architect’s Introduction to Chaos Engineering

David Mooter
The Startup
Published in
7 min readMay 7, 2020

--

Image of chaotic colors, like an explosion.
Photo by ActionVance on Unsplash

Your infrastructure will fail. It’s not an if but a when. As the rise of microservices and serverless make apps more distributed, potential fault points are rising exponentially. We may attempt to engineer our systems expecting certain failures only to make things worse, such as a well-intended retry logic overloading a stressed server even worse and causing failures to cascade across the enterprise.

In the old days you had a person or one team who understood your system so well that they could engineer against most failures or immediately diagnose and fix the unexpected production failures that did slip though. That was possible with a monolithic app. With a microservices architecture, those days are coming to an end. We now need an approach to system resilience that assumes the system is too complex for humans to understand and assumes things will break in ways we cannot predict. Chaos engineering is a methodology that takes that approach.

What is Chaos Engineering?

Chaos engineering is a methodology that discovers your system’s faults by intentionally injecting problems into production systems in a controlled manner. Faults are wide ranging, from latency, simulated disk failure, node outage, and even simulating the outage of an entire region.

Benefits of Chaos…

--

--

David Mooter
The Startup

With over 20 years of experience in IT, David is an analyst at Forrester Research covering modern application architecture. Articles here are his own opinion.