Business Continuity Plan & Disaster Recovery is too old, make a way for Chaos Engineering

Photo by Tim Scharner on Unsplash

Being a technologist who is responsible for managing large scale systems, you always need to think about the continuity, security and manageability of the system. Recently I have been spending a lot time to better understand how do I put up a proper Disaster Recovery(DR) & Business Continuity Plan (BCP) for a large distributed system. My quest to know more about the topic end up in getting introduced to a new stream of engineering that every system should adopt is — Chaos Engineering!

I ended up reading a free book from O’Reilly’s called Chaos Engineering — Building Confidence in System Behaviour through Experiments. I was aware of Netflix’s Chaos Monkey since many years but I choose to always ignore saying I am not building a another Netflix until I ended up reading this book.

The book starts with a very interesting definition of Chaos Engineering —

Chaos Engineering is the discipline of experimenting on a dis‐
tributed system in order to build confidence in the system’s capabil‐
ity to withstand turbulent conditions in production.
 — Principles of Chaos

Another very simple yet meaning definition from Gremlin is

Breaking things on purpose in order to build more resilient systems!

I never thought a chaos can be used to bring the discipline!

In a distributed system, you will be never be able to prevent all failures but you can always be prepared for it. As you experiment on your infrastructure, you get to know about the weakness of the systems and can help better plan to avoid those.

By now you must have been wondering — How Chaos Engineering is different from testing? Let’s try to understand this better

Testing Vs Chaos Engineering

Most of the time a good testing plan talks about load testing, security testing, functionality testing at load. Unfortunately we only do these tests only on “Non-production” environments. We test only non-production environments and hope that system behaves the same on production environment. This is the place where Chaos Engineering tries to prepare us by doing experiments as close to the production environments or sometimes even on production environments.

Another difference in testing and Chaos Engineering is, out come of Chaos Engineering brings new knowledge about the system which even developers or testers might not be aware of.

Chaos Experiments

Now let’s try to understand how does Chaos experiments looks like.

  1. If you are running Hadoop environments then taking down few nodes of Hadoop cluster or even generating Chaos by taking down High Availability components.
  2. On Kafka clusters, deleting messages from a particular partition or topics.
  3. Making system cloaks out of sync.
  4. Eating out CPU/Memory on Elastic Search clusters, NoSQL Clusters, Web Servers etc.
  5. Injecting chaos in functions

These are some of the examples and proper study of your system will help identify experiments for your system.

Who practices Chaos Engineering?

There have been communities setup like focused to this engineering discipline. The community shows the practitioners of Chaos Engineering are in companies like Google, Amazon, Uber, Yahoo!, Dropbox, Visa, New Relic etc. And the number is growing everyday.

The practice is not limited to software based companies but it is growing in areas like Defence, Manufacturing, Farming, Medical, Financial etc.

Should we practice Chaos Engineering?

Quoting from the book — Is your system resilient to real-world events such as service failures and network latency spikes?

If the answer is No then you need to practice Chaos Engineering. You should have strong monitoring systems in place to understand the state of your system.

As we are moving more and more towards micro-services based architecture, practising Chaos Engineering is becoming essential.

Advanced Principles of Chaos Engineering

As per following are the advanced principles of Chaos Engineering

1. Build a Hypothesis around Steady State Behaviour

Focus on the measurable output of a system, rather than internal attributes of the system. By focusing on systemic behavior patterns during experiments, Chaos verifies that the system does work, rather than trying to validate how it works.

2. Vary Real-world Events

Chaos variables reflect real-world events. Prioritize events either by potential impact or estimated frequency. Consider events that correspond to hardware failures like servers dying,software failures like malformed responses, and non-failure events like a spike in traffic or a scaling event. Any event capable of disrupting steady state is a potential variable in a Chaos experiment.

3. Run Experiments in Production

Systems behave differently depending on environment and traffic patterns. Chaos strongly prefers to experiment directly on production traffic.

4. Automate Experiments to Run Continuously

Chaos experiments should run continuously rather one time or periodic checks.

5. Minimise Blast Radius

Experimenting in production has the potential to cause unnecessary customer pain. So Chaos Engineer needs to make sure that the implications of the experiments should be manageable.

So how do I design my Chaos Experiment?

As the books suggests, the Chaos can be designed by considering following steps.

  1. Pick a hypothesis
  2. Choose the scope of the experiment
  3. Identify the metrics you’re going to watch
  4. Notify the organization
  5. Run the experiment
  6. Analyze the results
  7. Increase the scope
  8. Automate

The steps and details might vary system to system but overall design should be pretty much around above mentioned steps

Be Careful

Do remember to first understand and then only implement any Chaos Experiment. You are responsible for your own actions.

If you think this is something you would like to try and know more, check out following material.


If you enjoyed the article, do let me know the feedback and your experiences. Happy Learning!