Understanding Chaos Engineering

Piyush Johar

Published in

Globant

11 min readDec 30, 2020

Written by Anagha Pawar & Piyush Johar

Introduction

Over the last few years we have seen major changes in the way systems are built. Enterprises continue to embrace the large scale distributed cloud architecture and micro-services because of the advantages they bring in terms of scalability and ease of integration. However this has also made the systems complex and difficult to predict. The traditional testing approach alone will not help to validate and predict the behavior of these systems. It’s here, chaos engineering can help bring in the required confidence and certify that the systems are resilient and fault tolerant.

Chaos Engineering

Chaos engineering can be defined as the systematic approach of experimenting with the system’s capability to withstand turbulent conditions. It is not random engineering experiments or unsupervised experiments. The focus is to introduce various disaster scenarios into the infrastructure in a thoughtful fashion and to test and record the system’s ability to respond under these induced conditions.

The Principle

Chaos engineering is all about proactively running experiments in order to reveal weaknesses in distributed systems. It’s a way to build confidence in systems, reveal the weakness and possible mitigation around it. So there is a need to describe the way in which such experiments can be carried out which can be referred to as chaos experiments.

The following steps define the typical flow of the chaos experiments:

Define steady state
The steady state can be defined as some measurable output of the system that indicates normal behavior, say error rates, throughput, the latency of a system etc.
Hypothesize about the steady State
Hypothesize that this steady state will continue in both the control group and the experimental group
Introduce real world events
Expose experimental groups to experiments by introducing real-world events like server crashes, network failures, power outages, hard drive malfunction etc.
Verify steady state and validate hypothesis
Test the hypothesis by comparing the steady state of the control group and experimental group. The smaller the variance, the more confidence we have in the system.

Chaos Engineering In Action:

Chaos experiments should be designed by considering all the services of the system rather than just testing individual components in the system. Tracking and monitoring various metrics can provide visibility into the steady state of the system and help in making the right conclusions about experiments. These metrics can be CPU usage, memory, disk usage, response time etc.

The following process can be adopted when designing chaos experiments in general.

Picking a hypothesis:
The acceptable behaviors of the system should be considered as the steady state. The goal would be to develop a baselined model that characterizes the steady state of the system based on various metrics. Steady state can be determined by regular testing of the application and collecting metrics which portray the healthiest state of the system. Once we have the steady state behavior and metrics, we can derive hypothesis/output for the experiment.
Choosing the blast radius of the experiment:
While planning the experiments we need to carefully consider how far we need to go to learn something useful about the system. We should define the area which can get affected by the experiment as its blast radius. The scope of the experiment should be determined to minimize the impact on real customers if running on production. Although running tests closer to production yields better results, not containing the blast radius can cause major issues. A recommended approach is to initially start experiments in staging environments and with the smallest possible tests, until one gains confidence and is ready to simulate bigger events.
Choosing the right metrics:
From the set of available metrics, choosing the correct metrics which can be used to evaluate the results of the experiments is important. Frequent evaluation of these will ensure capture the ongoing behavior of the system accurately, and help to identify potential pitfalls and aid in aborting the experiments if there is a larger impact on the system.
Involving the required teams:
It is important to notify and keep the various module teams informed about the experiments that will be carried out so that they are prepared to respond. This is required at initial stages of adopting chaos engineering. Once the teams are used to these types of experiments, they will gain confidence and will start incorporating correct measures and fixes. If experiments are run in production, the concerned stakeholders should be notified and the experiments should be planned for execution during the decided down time.
Running the experiments:
Here we start executing our experiments and observe the metrics for abnormal behavior. We should have the abort and rollback plan in place in case we see a huge variance in critical metrics readings and notice that the experiments are causing too much damage or impact on the system. The experiments should simulate real-world events as much as possible. Experimenting and fixing events around memory overload, network latency and failure, CPU consumption and exhaustion, deadlocks and dependency failures, failures in communication between services and not to forget the functional defects, can increase the confidence in the reliability of the system.
Analyzing the results:
Here the focus is on analyzing the metrics and the results once the experiments are completed and verify if the hypothesis was correct or if there was a change to the system’s steady state behavior. The results are shared with the required teams so that they can fix them. The harder it is to deviate from the system behavior with the experiments, the higher is the trust in the reliability of the system.
Increasing the blast radius:
As our experiments with smaller inputs and scope start succeeding, we can start increasing the scope. This will help us identify the break-point of the system as well.

Few Sample Inputs For Experiments

Experiments may vary depending on the architecture of the systems and can be many. However, in a distributed system and micro-services architecture deployed on cloud, the following are the most common experiments:

Making the instances unavailable/not reachable at random from the defined zone.
Simulating the failure of an entire region or zone.
Consuming and exhausting the CPU and Memory resources on the instances.
Creating dead-lock such that the resources in use are not released.
Injecting latency between services for a predetermined period.
Injecting function-based chaos randomly causing functions to throw exceptions.
Adding instructions to the target program and allowing fault injection to occur prior to certain instructions.

Best Practices

Considering the nature of experiments involved in chaos engineering, and the impact they have on the systems, it would not harm to refer and use the experiences of previously executed experiments as guide and reference point. It will not only help to mitigate the known risks but also guide in adhering to best practices. Few pointers include:

Minimize the blast radius:
Begin with small experiments to learn about unknowns. Start with a single instance, container or micro-service to reduce the potential side effects. Once we gain confidence, we can scale up the experiments.
Start in the staging environment:
To be safe and get the initial confidence in tests, it would be ideal to start with staging environments. Once the tests in this environment are successful, move to production.
Prioritizing Experiments:
Chaos experiments while running in production can have an impact on core business functionality. So prioritizing the experiments which can be executed safely without causing business impact helps.

Here the approach will be to categorize all the services of the system, either. under “critical” or “non-critical” buckets. This can be determined by factors such as the percentage of traffic a service receives, the output of the service etc.

Start experiments first with non-critical services to verify if the unavailability of these services is handled gracefully and the core business functionalities is not affected. If an attack on non-critical service brings the system down, then these services need to be moved under the critical service category.

The following order should be adopted when performing the experiments:

Example: Let’s say that we have containerized environments that help to rapidly create and deploy new containers quickly in case existing ones disappear due to a server crash or other issue.

So mapping this to the above order,

Known-Knowns: Things we are aware of and understand
Here we know that when one node or replica container shuts down, it will disappear from the node cluster. New replicas will be created and re-added to the cluster.

Known-Unknowns: Things we are aware of but don’t fully understand
This is where we know the above, but lack knowledge of the time it will take between destruction of one clone and creation of a new one.

Unknown-Knowns: Things we understand but are not aware of
In this case, we don’t know the mean time for creation of new replicas on a specific date or in a certain environment, but we do know how many there were and how many will be created to replace them.

Unknown-Unknowns: Things we are neither aware of nor fully understand
Here for instance, we don’t know what will happen when the total system shuts-down or if the virtual region failover will be effective because we have no previous trials or baseline for comparison.

Be ready to kill and revert:
Make sure we have done enough work to stop/kill any experiment immediately and revert the system back to a normal state in scenario the experiment causes a severe outage. Further ensure to track the instances carefully and do an analysis to avoid it happening again.
Estimate cost and return on investments:
Business impact of the outages can vary based on the nature of the business. Impact on revenue in case of outages can be calculated by number of incidents, outages, their severity and the contractual obligations with clients. So comparing these with the cost that will occur to run the chaos experiments will help in arriving at the right conclusion.
Use tools where possible :
With multiple tools available in the market, understand their offerings. Compare features that they provide, the scenarios they help cover vs the time and effort that would be required if executed without using them. Select tools which perform thoughtful, planned and controlled experiments and aid in measuring specific metrics that are useful for the system under test.

Challenges With Chaos Engineering

Implementing chaos experiments is not easy and does come with its own challenges. Some of these include:

The efforts of implementing the experiments. Putting together a sufficiently systematic approach to cover “enough” scenarios can be a major challenge.
The risks that these chaos experiments bring with themselves, if not properly thought through. These can lead to complete disruption of the systems even before they can be aborted. Many times the harm is so severe that it cannot be undone. So a thorough planning and mitigation process needs to be in place.
Analyzing the results of chaos experiments can be a tedious process and takes efforts. Although the experiments can sometimes be excellent in identifying weaknesses, implementing the fix at times can be like disrupting the entire implementation.

Benefits Of Chaos Engineering

The fundamental benefit of carrying out chaos engineering is their ability to identify weaknesses before they appear as incidents in the production environment. Few other benefits include:

Can help prevent extremely large losses in revenue and maintenance costs. Outages can typically cost companies millions of dollars in revenue depending on the usage of the system and the duration of the outage. Not to forget the reputation of the company that might suffer during this cycle.
Increases confidence in carrying out disaster recovery methods. Most teams do not have enough confidence in full-scale disaster recovery as they perform these tasks only in extreme disaster cases. If the whole team has adopted principles of chaos engineering, disaster recovery practices can be more streamlined and performed with greater confidence.
The issues found can help engineers to better understand the systems they develop. Engineers may also learn which parts of the system are the most critical ones, and which are less critical.

Tools

As chaos engineering continues to evolve, so do the various tools which are available in the market. Each of these come with their own set of features, ease of use, system/platform support and extensibility. Following are few of the tools available for performing chaos experiments:

Chaos Monkey
It was one of the first open source chaos engineering tools. It primarily helped in terminating virtual machine instances. Built as part of the Simian Army project, many of these tools have since been retired or rolled into other tools like Swabbie, Spinnaker.
https://netflix.github.io/chaosmonkey/
Gremlin
Gremlin offers a failure-as-a-service tool to make chaos engineering easier to deploy. It includes a variety of safeguards built into break infrastructure responsibly.
https://www.gremlin.com/
ChaosBlade
This tool supports a wide range of platforms including Kubernetes, cloud platforms, and bare-metal. It provides dozens of attacks including packet loss, process killing, and resource consumption. It also supports application-level fault injection for Java, C++, and Node.js applications, which provides arbitrary code injection, delayed code execution, and modifying memory values.
https://github.com/chaosblade-io/chaosblade
ChaoSlingr
This is a security chaos engineering tool focused primarily on the experimentation on AWS Infrastructure to bring system security weaknesses to the forefront.
https://github.com/Optum/ChaoSlingr
Chaos Mesh
This is a Kubernetes-native tool that offers 17 unique attacks including resource consumption, network latency, packet loss, bandwidth restriction, disk I/O latency, system time manipulation, and even kernel panics. Chaos Mesh is one of the few open source tools to include a fully-featured web user interface (UI) called the Chaos Dashboard.
https://github.com/chaos-mesh/chaos-mesh
Litmus
This is also a Kubernetes-native tool. It provides a large number of experiments for testing containers, Pods, and nodes, as well as specific platforms and tools.
https://litmuschaos.io/
PowerfulSeal
This is a CLI tool for running experiments on Kubernetes clusters.
https://github.com/powerfulseal/powerfulseal
Toxiproxy
This is a network failure injection tool that lets users create conditions such as latency, connection loss, bandwidth throttling, and packet manipulation. As the name implies, it acts as a proxy that sits between two services and can inject failure directly into traffic.
https://github.com/Shopify/toxiproxy
Pumba
It is a chaos testing command line tool for Docker containers. Pumba disturbs your containers by crashing containerized application, emulating network failures and stress-testing container resources (cpu, memory, fs, io, and others)
https://github.com/alexei-led/pumba

Conclusion

Chaos engineering presents the most incredible practicing process as part of resiliency strategy. The opportunities and the scope that comes with these chaos experiments are infinite. With the right strategy in place not only will the stakeholders get the required confidence in their system, but it will also help them to design and architect the systems considering all the possible pitfalls, that otherwise would have been ignored and not thought of.

References:
https://www.gremlin.com/
https://principlesofchaos.org/