Making Chaos Work for You: Choosing the Right Chaos Engineering Tool

Anish Kumar Chaudhary
Naukri Engineering
Published in
5 min readAug 4, 2023

What happens if just one of your core services is no more available? Or if a server is disconnected for whatever reason? Or if data is no more accessible?

We must know, every application is bound to fail regardless of how well it is built. These failures lead to the application’s downtime and ultimately hamper the user experience.

By addressing these issues, we can proactively minimize system failure and make it more resilient. However, if we cannot afford to wait for an actual disruption to occur, what steps can we take to achieve this?

The focus of this article is on a tool that can be used to intentionally create chaos in a system and improve its performance. This type of testing is known as resilience testing, which aims to evaluate how an application will perform under chaotic circumstances. By leveraging this chaos, we can conduct both functional resilience testing and performance resilience testing.

There are a few open-source tools available in the market that we evaluated for inducing faults in our system and checking their resilience. The following tools will be discussed.

  1. Chaos Kube
  2. Kube Monkey
  3. Pumba
  4. Chaos Blade
  5. Chaos Mesh
  6. Litmus Chaos

Tools Comparision

1. Platform Support

2. Visualisation Capabilities

3. Method for Creating Experiment

4. Types of Fault Injection

  • Pod Fault: This simulates Pod failures, such as Pod termination, Pod’s persistent unavailability, etc.
  • Container Fault: This simulates container failures, such as main process termination, etc.
  • Network Fault: This simulates network failures, such as network latency, packet loss, packet disorder, network partitions, etc.
  • Http Fault: This simulates HTTP communication failures, such as HTTP communication latency, Request and Response replacement, etc.
  • Stress Faults: This simulates stress on the resource, such as CPU or memory.
  • Application Faults: This simulates application failures, such as JVM function call delay, an exception to be thrown, etc.

5. Access Control

Access control refers to the management of user permissions and restrictions (like tokens or credentials) for accessing certain parts of a system, such as a user interface or the scope of an experiment.

However, they do require access to the Kubernetes API to identify and inject chaos into the cluster.

Prominent Features of Chaos Mesh

Chaos Mesh is an open-source cloud-native Chaos Engineering platform. Using Chaos Mesh, you can conveniently simulate various abnormalities that might occur in reality during the development, testing, and production environments and find potential problems in the system.

RBAC Authentication for the Scope of the Experiment:

It needs to have proper user permission in the API group in chaos-meh.org so that users can view and manage the chaos experiment. However, it can be disabled but not generally recommended.

Here we can select the scope of the experiment (i.e. cluster scoped or namespaced) and access role (i.e. manager or viewer). This creates a YAML file which is used to generate the token. This generated token is used to log into the chaos mesh dashboard.

Types of Fault Injection:

Chaos Mesh covers almost every abnormality that might occur in our system which makes it a wholesome tool for chaos Engineering. Following are the Fault injection which covers both Kubernetes and Host fault injection.

Orchestrate Series and Parallel Experiment With workflow:

We can create a serial or parallel experiment as a workflow in the chaos Mesh to ensure that the chaos experiments are conducted efficiently and all necessary steps are completed in the correct order.

Setting scope and Mode of the experiment:

Here we can set the scope of the experiment by namespace and labels. In addition, it provides a preview field that shows all the targets. This helps us to further refine the scope of our experiment and ensure that our burst radius does not touch critical components so that our experiment is conducted in a safe and controlled manner. Chaos mesh operates in a total of 5 modes, each of which is designed to test a different aspect of the system. These are

  • Random one
  • Fixed number
  • Fixed percent
  • Random max percent
  • All

Conclusion

Because of the promising feature it offers, ease of implementation, and visualization capabilities, We have selected Chaos Mesh for further implementation.

Keep in mind that, Having the ability to effectively inject a variety of failures repeatedly is not sufficient for successful chaos engineering. It is crucial to analyze the system behavior and understand the impact caused by the injected failures to enhance the system’s resilience.

--

--