Chaos Engineering-Part I
Build the system by breaking the system
Have you ever logged onto a website to shop for a Christmas sale? What if the dress you were so eager to buy is right in your cart and as you begin to checkout, the site goes down? Damn my luck! What if multiple users using the application start facing downtime exactly at the same time
A big loss for the company!
And that’s where Chaos Engineering helps us
Chaos Engineering?
Introducing disaster-like conditions that could happen in reality and checking the system performance
Possible scenarios that could happen in real time:-
Before we go further, let’s understand the term ‘Resilience’
Resilience is “The system’s ability to keep afloat when a fault happens”.
Another definition is “The ability of a system to recover from infrastructure or service disruptions.”
This is one of the most important factors to take care of while building infrastructure. Low resiliency can lead to increased vulnerability, inadequate recovery, limited scalability, dependency risks, longer restoration time and loss of reputation.
That’s exactly where Chaos Engineering will help us
Chaos Engineering is deliberately inducing fault into a system to identify what can happen when it happens in reality
The goal is also to discover weaknesses in a system through controlled experiments that introduce random and unpredictable behavior in the system
This happens in 4 simple steps
1)Steady state
2)Hypothesis
3)Experiment
4)Adapt
What is Steady state?
The way your system behaves in a normal condition is the steady state
Hypothesis?
You create/build a hypothesis around this steady state. You note down how your system behaves in a steady state. You hypothesize how your system might behave in case of an outage
Experiments?
Based on your hypothesis, you build your experiments. For eg your experiment could be related to network outage, network slowness
Adapt?
Based on the results of the experiment, you further decide what changes you need to make in the system to make it further resilient
There are many tools available in the market that help us do chaos testing.
In this blog, we will see an example of a Chaos Experiment with the help of the Litmus Chaos tool
Experiment: NETWORK LOSS
How and where do you configure your experiment
All the experiments that could be performed are available within the Litmus Chaos tool
Chaos Center provides various options to create experiments.
Option 3 ChaosHubs has many pre-defined experiments. As a new user, preferably go with Option 3 to understand and view the different experiments
Workflow settings
In this section, you provide name for your workflow. This can be any name that helps you identify the scenario. Additionally, you can provide a description for the workflow
Tune workflow
Now comes the main part of the workflow. Here, you add the experiments you want to perform. For this blog, let us go with the network loss experiment
Search with network-loss and you will be able to find an experiment called generic/pod-network loss
Reliability score
Reliability scores give you the option of assigning scores to the experiment you have selected. In the above example, we have selected one experiment-network loss. But for eg. If we have 2 experiments node-drain and node-cpu-hog, then this is how it would look.
Here, if the experiment node-drain is more critical, you may assign a higher resiliency score eg:-10 and 8 for node-cpu-hog
Schedule
You can select the schedule here depending on how frequently you would like to run your tests
Verify and Commit
Once you click on Verify and Commit, you are all set to put your chaos experiment to run
Awesome! This is how you can create and schedule a run in the LitmusChaos tool
I hope this blog helps you in having a basic understanding of Chaos Engineering and also presents a quick view of how to create an experiment.
For more information on Chaos engineering, kindly refer to my next blog:-Chaos Engineering Part 2
Thank you!!