Chaos Engineering-Part II

Riddhi Pandya
4 min readJun 25, 2024

--

Build the system by breaking the system

In the previous blog, I discussed what is Chaos Engineering and how we create an experiment. In this blog, I will discuss in detail about Chaos experiments

If you didn’t get a chance to read through the previous blog, Please find the link to the same: Chaos Engineering Part 1

As discussed previously, Chaos engineering happens in 4 steps:-

  1. Steady state
  2. Hypothesis
  3. Experiment
  4. Adapt

Now, let us take a simple usecase and create an experiment surrounding that usecase. Remember, the hypotheses/details we discuss will differ as per scenario to scenario and requirement to requirement

Example Use Case

Application under test: E-commerce application

Aim:- To find out the behavior of an application under a disaster like situation (Here as an example, Network Loss)

Let’s begin

  1. Steady state
    The steady state for our sample use case would be, the application and all the different sections/modules of the e-commerce application should be accessible and available to the user. For eg:- All the modules- Carts, Login, Apparels, Electronics, Shoes are available to the user and the user is able to order a product and proceed with the checkout
  2. Hypothesis
    Assuming there is a network loss, we hypothesize that even in this case the system would be available, the response time however might be higher as compared to the normal scenario
  3. Experiment
    We run the experiment but observe that multiple modules are not available and the user is not able to access them as they are down
  4. Adapt
    Well, this means that the system is vulnerable to failure and it needs to be improved further to make it more reliable. We further work on it and start again from step 3

Explained above is a basic scenario of Chaos Engineering

But there’s more:-

In complex use cases, you might need to know more about your systems and have a detailed understanding of what should prevail and what needs to be improved. Here, we make use of performance testing tools

Why performance testing tools?

Performance test tools give you an understanding of your system performance with the help of metrics like Average response time, percent CPU utilization, Maximum/minimum response time, Number of failed responses and many more.

CHAOS ENGINEERING WITH PERFORMANCE TESTING

Example Use Case: E-commerce application

Steady state
To understand the steady state of an application, you first need to run a performance test against your application that helps you get metrics like Response time, Throughput, Error rate, Hits per sec, CPU utilization.
Any performance test tool can be used

We note these metrics
As an example, let’s say response time for the Login module is 5000ms

Hypothesis
We hypothesize that given a situation where in there is a network loss, the login module would still work but the response time might be slightly higher say 15000ms against 5000ms

Experiment
You now start the experiment, and simultaneously run the performance test. Observation is that the response time is 50000ms. Note down the metrics

Adapt
This meant that network loss had a significant impact on the performance of the system and there is a need to improvise the architecture to have a lower response time

Note:- Tools like Prometheus and Grafana can be integrated for monitoring and reporting purpose

Example graph of Prometheus monitoring tool

List of experiments available in Litmus Chaos

  • pod-network-corruption
  • pod-delete
  • pod-autoscaler
  • pod-cpu-hog-exec
  • pod-cpu-hog
  • pod-dns-error
  • pod-dns-spoof
  • pod-io-stress
  • pod-memory-hog-exec
  • pod-network-corruption
  • pod-network-duplication
  • pod-network-latency

For more information on Chaos and to get in detail the technical understanding you can refer to the documentation:- Litmus docs link

To know more about the different experiments available, check out this link:- Experiments link

Important points to know before implementing Chaos

  • You need to have a good knowledge of the Service Architecture of your application and understand the upstream-downstream dependencies well enough to know the impact it might cause
  • Understand the blast radius. What is the blast radius?
    Blast radius are the areas of the application that would directly/indirectly get impacted during the experiments
  • This means you need to make sure you are not affecting critical areas of the business in production during the tests. It needs to be controlled so as not to cause major interruption
  • Apply site reliability engineering best practices before injecting Chaos eg:- if it’s a single region service, and if you apply Chaos, the app would be completely down in real time. But if you have cross-region support, then you can apply Chaos to ensure service availability during Chaos

Hope this blog helps you in getting a detailed understanding of the Chaos experiments! Happy learning!

Thank you!

--

--