Chaos Engineering

From Zero to Hero

Pablo Del Giudice
Globant
4 min readAug 4, 2020

--

Principle of Chaos Engineering

Sometimes it’s better to start defining something by saying what is not:

Well, Chaos Engineering it’s not about a cat biting your power cables at your datacenter.

Instead it’s more or like this:

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

You can find a deeper explanation under: https://principlesofchaos.org/?lang=ENcontent

Chaos in practice

The basic approach to start practicing Chaos is:

To specifically address the uncertainty of distributed systems at scale, Chaos Engineering can be thought of as the facilitation of experiments to uncover systemic weaknesses.

These experiments follow four steps:

  1. Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
  2. Hypothesize that this steady state will continue in both the control group and the experimental group.
  3. Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
  4. Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
  5. The harder it is to disrupt the steady state, the more confidence we have in the behavior of the system. If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large.

Why should we use it?

Invests or not in Chaos Engineering?

As engineers, We mostly architects something in our head that is almost”unbeatable” but sometimes nature, physics or some other natural forces test our architecture:

Again, this is not about a cat but…

This is a well know picture, about a white shark biting a Google’s ocean undersea fiberoptics cable, and think for a minute with me…

“How about if that hits on some availability zone with only one instance running our backend production application”. Well… that is why you have to start testing your architecture or doing some pre-mortem analysis or maybe, start doing some Chaos Engineering.

And always remember the following.. from one of our greatest voices:

Decision-making framework

If you decided to start doing some experimenting, the next big question is “When do we have to use chaos engineering?” Well, I suggest to you a very powerful tool that is called “Cynefin-framework”, it guides you trough the process of identifying you your system is: Simple-Complicated-Complex-Chaotic

Long story short is, if you have a large distributed system who is under the zone of “Complicated-Complex” you need to start understating deeply what is under the hood and Chaos it might be a very good tool to do it.

I strongly suggest to see Prof. Dan Snowden video about how this framework works and how to use it: https://www.youtube.com/watch?v=N7oz366X0-8

In the World

What is happening in the world with this:

Toolset

Last but not least let’s share some tools to start doing some experiments:

  1. Grembling, a very tool Chaos-as-service platform
  2. The very well know Chaos Monkey with the Simian Army (be aware you have to use spinnaker)
  3. Chaos Toolkit a very nice toolset to start writing chaos experiments
  4. From Artillery.io, Chaos Lambda a very nice Lambda who starts on and off our EC2s

Some books I strongly recommend:

Have fun running and learning from you Chaos experiments!

Pablo!

--

--