Why is Chaos Engineering Critical?

Ram Iyer
STUFF.TECHNOLOGY
Published in
2 min readJul 16, 2021
Photo by Brett Jordan on Unsplash

Chaos Engineering, when you hear this word doesn’t it sounds a bit interesting? Chaos Engineering has gained importance with large-scale internet companies.

Why do we need Chaos Engineering?

Are we confident that the application/services which we own, will never have an outage, No right? So what would happen if there is an outage/performance degradation to which were not prepared and it has impacted the users. We lose our customer’s confidence. Chaos Engineering helps us to identify potential failures by testing our application/system's response to stress and fix them before it actually happens. Imagine the Fire Drill we do to identify how quickly the evacuation was done? did the alarm ring? did we had the fire services in time, etc.

History of Chaos Engineering

2010

The Engineering team in Netflix created Chaos Monkey, to move from physical infrastructure to cloud, to make sure the movement would not impact Netflix streaming if Amazon instance is lost

2012

Netflix shared the Chaos Monkey source code in GitHub.

2014

Role of Chaos Engineer was created by Netflix.

2016

Kolton Andrus and Matthew Fronaciari founded Gremlin, the world’s first managed enterprise Chaos Engineering Solution.

2020

AWS adopts Chaos Engineering as its reliability pillar.

2021

Gremlin published the State of Chaos Engineering highlighting its key benefits.

Chaos Engineering Principles

  1. Plan an Experiment: Create a Hypothesis
  2. Contain the blast radius: Start with small experiments
  3. Measure the impact: Found the issue, stop the experiment else increase the scope of the experiment

Chaos Engineer Role

Let’s understand the 8 fallacies of Distributed Systems

  1. The network has no issues and is reliable
  2. There is no Latency.
  3. Bandwidth is infinite.
  4. No loopholes in network security.
  5. No change in the Topology.
  6. We have only one administrator
  7. No Transport cost.
  8. The network is the same across

Many such fallacies drive the experiments in Chaos Engineering. An outage in the network can impact the application and the customers, there could be a resource crunch in the application due to bad coding or infrastructure issue. As a Chaos Engineer, these scenarios need to be tested and be ready with the fix.

How should I perform my experiment?

Ideally, the experiments should be in the below order. When planning for an experiment always have them in the below order

1. Known Knowns

2. Known Unknowns

3. Unknown Knowns

4. Unknown Unknowns

References

Gremlin is the pioneer in the Chaos Engineer solution. I would recommend going through their tutorials

Also, Gremlin provided certification on Chaos Engineering CCEP(Certified Chaos Engineer Practitioner)

--

--