Chaos Engineering and Story of Japanese Fisherman

An introduction to understand Chaos Engineering easily

Suharjono
Mandiri Engineering
4 min readMay 23, 2020

--

Chaos Engineering is one of methods in DevOps that introduced on 2010 by Netflix in order to achieve resilient and robust system. I will explain Chaos Engineering using a simple story that might be already known by many people. The story is about Japanese fisherman that have problem with the freshness of the fishes he caught. Even after he keep the fish alive in the tank after he catch the fishes, the fish become not fresh after he return to the coast. Long story short, as solution the fisherman put a small shark inside the tank to make fishes he caught keep awake and fresh. And it works, the fishes he caught keep fresh until he return to the coast.

Illustration: Photo by Fredrik Öhlander on Unsplash

Chaos Engineering more or less have the similar concept with the Japanese Fisherman story. Chaos is needed to motivate engineer to design and create resilient and robust system with advance error handling when dealing with infrastructure failures, network failures and application failures. Without considering chaos, engineer use to focus only on how the system run in ideal condition. They would just realized that there’s a chance of chaos when the incident really happened. And that’s already too late.

Implementation: Chaos Monkey

Netflix created an application called Chaos Monkey to become the “shark” that will keep the “fish” fresh. Basically Chaos Monkey is service that created using Golang that installed in one of the containers that have function to randomly terminates other container / virtual machine instant to create chaos. Netflix already share the code in their public repository here (https://github.com/Netflix/chaosmonkey). Using concept of PaaS (Platform as a Service) and IaaS (Infrastructure as a Service), a VM, container or service can be configured to auto recover after terminated. But inside the service should also implement following logic to keep the data integrity and minimize down time to customers. Engineer should consider the logic when design and coding the services to achieve that.

Chaos Monkey illustration

5 Steps of Implementing Chaos Engineering

  1. Define the steady state

In this step we’ll define the steady safe of the system. It’s defined by some metrics of output of the system like throughput, error rate, etc. By defining that we can differentiate the steady state and chaos state

2. Elaborate all possibility chaos event in real world

This step is to identify all possibilities of event that potentially will cause chaos like some server down, some service down, etc. By identifying those we understand and can prepare for those scenario of chaos.

3. Run the experiment

This step is to simulate the events we already list in step 2. In this step we can start using the Chaos Monkey start running the experiment. Check whether the system recovery immediately or chaos really happens. If chaos really happens then enhance the service affected to handle auto recover the chaos.

4. Automate the experiment to make it continuously run

After the system run steadily and can recover from chaos experiment in step 3. This step is to make the chaos experiment run continuously. By running continuously it will make sure if there is change to any service, the change is already accommodate chaos engineering.

5. Minimize the impact of experiment

By doing chaos experiment continuously there must be impacted on the time to recover of the service. Please keep in mind that the objective of the experiment is to inspect the readiness of system in handling chaos, not really create a chaos, so make sure the during the time to recover of the affected service there are enough buffer to backup the service.

Pro and Cons

Chaos Engineering exercise our system in real production environment. Playing with production environment require high level of discipline otherwise it will create real chaos. Below is some drawback that should be prepared before implementing it in production:

  1. Extra / redundant resource capacity in each environment (Development, Testing, Production) as backup buffer when affected service in the recovery time.
  2. Gradually implement chaos engineering since in environment development and testing.
  3. Require high discipline of engineer in implementing advance error handling since in development and testing environment. This will also cause extra time to develop.

Reference

  1. https://github.com/Netflix/chaosmonkey
  2. http://principlesofchaos.org
  3. https://www.gremlin.com/community/tutorials/chaos-engineering-the-history-principles-and-practice/

--

--