How to Use LitmusChaos Experiments to Prove High Availability

Murphy Conor
intive Developers
Published in
4 min readMay 12, 2022

In recent years, technology has become so advanced that programs and applications are now being applied in complex areas such as remote surgery and autonomous vehicles. However, this means that it has never been more essential that error is prevented, as the ramifications of a system failure in such use cases could be dangerous, even fatal.

Therefore, these tech advancements need to be accompanied by increasingly rigorous testing capabilities in order to achieve certainty that our systems are fail-safe. One testing framework referred to as LitmusChaos is considered one of the most advanced ways to detect potential risks and disruptions and guarantee what is known as high availability (HA).

So, why is HA so important, how can LitmusChaos experiments safeguard it, and how can this type of testing be implemented?

Let’s take a look.

What is High Availability?

High availability (HA) is a system’s ability to continue to deliver upon its key performance indicators (KPIs) even when failures occur. These failures can be internal to the application, in the connected services such as a database, in the container orchestrator, or in the infrastructure itself.

In order to abide by HA principles, a service or product must have a particular scale of uptime, meaning that it should be available 99.999% of the time. This is also known as 11 nines durability.

This means that any system which needs to achieve a HA standard is required to be tested robustly before any kind of release can occur. However, even the best-laid test plans of mice and men will still leave bugs occurring on customer sites. This is where additional testing can be majorly beneficial to minimize issues, especially in disaster scenarios that are more difficult to replicate during a testing cycle.

What is LitmusChaos?

LitmusChaos is a framework that introduces ‘chaos’ into a system to purposefully disrupt the application, clusters, the underlying infrastructure, and even across cloud providers.

This disruption aims to challenge the system by dropping in potential scenarios for it to try and combat such as if the CPU or memory runs out, if one of the Kubernetes nodes or pods goes down, or if the disks on the Google Cloud Platform Virtual Machines disappear. Most importantly, it determines whether the system can still hit its KPI targets despite these disruptions.

LitmusChaos is made up of two core architectural pieces; the Chaos Control Plane and the Chaos Target Plane. The control plane is the central service that all the experiments are driven from and where the main Litmus operator is deployed. The Chaos Target Plane is the cluster(s) in which the experiments will run. These two planes communicate through the Litmus agent, instances of fault which are remotely inserted into each of the target locations.

How Does LitmusChaos Test an Application?

LitmusChaos utilizes two ‘flows’: the experiment flow and the observability flow. The first indicates how the ‘chaos’ is introduced into the system. The experiments begin by establishing the user inputs such as how many pods should be deleted, which nodes to cordon off, and how long the experiment should last.

The chaos workflow then begins on the Kubernetes cluster through the use of an Argo workflow, which facilitates a multi-step chaos ‘pipeline” where the ‘cluster disruptors’ are put in place, executed, and then removed from the system. Finally, post-chaos checks are performed to ensure that the desired state of chaos was reached.

The observability flow is focused on measuring the performance of the application to ensure that the KPIs are maintained or to provide proof where the system falters and doesn’t reach the HA capability. This is done by adding probes to the chaos experiment. These probes can take the form of HTTP requests, Kubernetes API commands, or metrics polling. The probes can be used individually or in combination to check HA at a single microservice granularity or on a system-wide scale.

For example, the Kubernetes probe can be used in a pod deletion chaos experiment to check for the existence of the service, to ensure that it is resilient, and will be restored in a certain amount of time. The HTTP probe can check if all the CPU or memory is used on the cluster and that the microservice will still accept incoming traffic. The metrics probe is useful for any service which has well-defined internal metrics to guarantee that it is achieving the desired throughput or latency needed to satisfy service level agreements for customers.

All of the workflows that make up the chaos test suite can be viewed within the Litmus Web Portal in the form of a user-friendly dashboard. This portal is hugely useful when debugging issues that arise within testing or while simply checking the test results.

LitmusChaos is the ideal form of testing that guarantees the robustness and resilience of our applications and products. A chaos-based test suite made up of several different experiments

focusing on service, cluster, and infrastructure allows each piece of a system to be tested in multiple ways. This variety of testing levels creates versatile systems that are highly available and able to weather the storm of potential disruption and chaos.

--

--