Why is Chaos Engineering Critical?

Published in

STUFF.TECHNOLOGY

2 min readJul 16, 2021

Chaos Engineering, when you hear this word doesn’t it sounds a bit interesting? Chaos Engineering has gained importance with large-scale internet companies.

Why do we need Chaos Engineering?

Are we confident that the application/services which we own, will never have an outage, No right? So what would happen if there is an outage/performance degradation to which were not prepared and it has impacted the users. We lose our customer’s confidence. Chaos Engineering helps us to identify potential failures by testing our application/system's response to stress and fix them before it actually happens. Imagine the Fire Drill we do to identify how quickly the evacuation was done? did the alarm ring? did we had the fire services in time, etc.

History of Chaos Engineering

2010

The Engineering team in Netflix created Chaos Monkey, to move from physical infrastructure to cloud, to make sure the movement would not impact Netflix streaming if Amazon instance is lost

2012

Netflix shared the Chaos Monkey source code in GitHub.

2014

Role of Chaos Engineer was created by Netflix.

2016

Kolton Andrus and Matthew Fronaciari founded Gremlin, the world’s first managed enterprise Chaos Engineering Solution.

2020

AWS adopts Chaos Engineering as its reliability pillar.

2021

Gremlin published the State of Chaos Engineering highlighting its key benefits.

Chaos Engineering Principles

Plan an Experiment: Create a Hypothesis
Contain the blast radius: Start with small experiments
Measure the impact: Found the issue, stop the experiment else increase the scope of the experiment

Chaos Engineer Role

Let’s understand the 8 fallacies of Distributed Systems

The network has no issues and is reliable
There is no Latency.
Bandwidth is infinite.
No loopholes in network security.
No change in the Topology.
We have only one administrator
No Transport cost.
The network is the same across

Many such fallacies drive the experiments in Chaos Engineering. An outage in the network can impact the application and the customers, there could be a resource crunch in the application due to bad coding or infrastructure issue. As a Chaos Engineer, these scenarios need to be tested and be ready with the fix.