Testing is sexy again! Welcome Chaos Engineering

When the focus shifts from “Does the implementation match the specification?” to “Does everything seem to be working properly?” and “Are users complaining?” — the testing paradigm changes.

How is it different from traditional testing paradigm — initially it started with a view to verifying the functionality that a program provided. Slowly more areas were added and this functional testing got extended into performance testing. The idea was to establish and measure how the program responded under “planned load conditions”.

A couple of years back when we first deployed our distributed application — the first question was what if the nodes on the cluster crash and how do we simulate such events. Interestingly we were not the first ones to arrive at the conclusion. With the advent of cloud and distributed computing more and more factors started influencing the production deployments. For example, what happens if network latency decreases dramatically, what happens when one data center goes down and so on. These questions lead to the birth of Chaos Engineering and the concept of controlled failure injection.

Netflix a pioneer in the field (of course an awesome video streaming service too), defines Chaos Engineering as follows:

Chaos Engineering is the discipline of
experimenting on a distributed system in order to
build confidence in the system’s capability to
withstand turbulent conditions in production.

So the question is how do we simulate chaos in a controlled manner? The chaos engineering has four key points to guide to execute the experiments which will help you adopt it.

  • Establish steady state — benchmarking process -which KPI/s represent it
  • Build a hypothesis around steady-state behavior — how the treatment will affect the system’s steady state.
  • Create and vary real-world events — hardware crashing, network latency, failed requests between services, terminate virtual machines, data center — anything! Keep in view that 92% of catastrophic system failures are reported due to non-fatal errors. So everything is welcome. Prioritize events over potential impact or frequency.
  • Run experiments in production — like anything this big, the best approach is CRAWL -WALK -RUN. Start with pre-production environment and carry your experiments. Once your practice matures move to a Production environment. Chaos strongly prefers experiments on production traffic.
  • Automate experiments to run continuously. Netflix has been working on tooling starting with Chaos Monkey in 2012, tools which introduce network latency and so on. Today we have Chaos Monkey, Gorilla and Kong which aim to bring randomly terminate a random instance, AZ and region in your AWS account. The simian army has latency monkey, security monkey and 10–18 others.
  • Ensure that the blast area is minimized and control. Especially when running experiments in production keep stakeholders informed. No one likes to run a nuclear fusion reaction!

What next?

The outcome from Chaos experiments is expected to lead to Architectural change! The approach helps define SLAs for the unexpected and provides a handle to test for the unexpected. The cloud will break, you need to be ready!

Conclusion

Over the years we have heard that testing is boring and down the value chain. I have always contested that saying — what I have believed is that the repetitive manual testing is useless. The discipline of Chaos engineering is a super example — this is a new are for seriously creative testers. On the other hand with the distributed architecture, the resilience design is iterative and you must iterate like you iterate software. A key opportunity for the developers.

This area will see a lot of traction and interest. What do you think? Share your thoughts and experiences.

Further readings and references:

  1. http://principlesofchaos.org/
  2. K. Andrus, N. Gopalani, and B. Schmaus, “FIT: Failure Injection Testing,” Netflix Tech Blog, 23 Oct. 2014.
  3. J. Robbins et al., “Resilience Engineering: Learning to Embrace Failure,” ACM Queue, vol. 10, no. 9, 2012.
  4. https://www.infoq.com/articles/chaos-engineering

Originally published at crispanalytics.com on June 22, 2017.