Declarative Approach to Chaos Hypothesis using Litmus Probes

Sarita Behera
Litmus-Chaos
Published in
7 min readMay 20, 2021

Introduction

One of the foremost principles of chaos engineering is the need to build a hypothesis around steady-state or desired system behavior and be able to automate the process of gauging, analyzing & reporting it over the course of an experiment. This is indispensable to the persona executing these experiments in order to gain valuable insights on the resilience of their applications & deployment infrastructure. In other words, it is imperative for chaos engineering frameworks/toolsets to provide the ability to define their hypotheses & validate them against the same.

Further, in the world of cloud-native chaos engineering, the hypothesis definition is expected to be declarative in nature, much in the same way the chaos intent is. Another very interesting factor here is the diversity in the way the steady-state is defined. Most of the time, they are systemic behavior patterns: metrics around error rates, latency percentiles, etc., and at other times liveness of a crucial downstream service, state (values within) of a database, or even Kubernetes resources. Typically, they can be mapped to operational SLOs (Service Level Objectives) that have been agreed upon. These are the parameters & conditions that verify that the system “works” (as a discerning reader you might have understood the subtle difference here b/w chaos experimentation over standard failure testing: verification that the system works over validating how it works. Anyway, I digress..).

So, to summarize our understanding (courtesy: standard chaos principles & learnings from the awesome chaos community):

  1. Hypotheses are crucial to chaos experiments
  2. Chaos frameworks need to burn this into experiments (remember automated chaos!)
  3. Hypothesis definitions should be declarative in nature & thereby easily tuned & scaled
  4. Accommodate the diverse nature of hypotheses definitions, with the right schema & architecture.

Hypothesis in LitmusChaos Experiments:

  • httpProbe: To query health/downstream URIs
  • cmdProbe: To execute any user-desired health-check function implemented as a shell command
  • k8sProbe: To perform CRUD operations against native & custom Kubernetes resources

These probes can be used in isolation or in several combinations to achieve the desired checks. As we will see in subsequent sections, while the httpProbe & k8sProbe are fully declarative in the way they are conceived, the cmdProbe expects the user to provide a shell command to implement checks that are highly specific to the application use case. Does it sound similar to the “command” or “shell” module in the ansible world? The intent is similar too!

The probes can be set up to run in different modes:

  • SoT: Executed at the Start of Test as a pre-chaos check
  • EoT: Executed at the End of Test as a post-chaos check
  • Edge: Executed both, before and after the chaos
  • Continuous: The probe is executed continuously, with a specified polling interval during the chaos injection.
  • OnChaos: The probe is executed continuously, with a specified polling interval strictly for chaos duration of chaos.

All probes share some common attributes:

  • probeTimeout: Represents the time limit for the probe to execute the check specified and return the expected data.
  • retry: The number of times a check is a re-run upon failure in the first attempt, before declaring the probe status as failed.
  • interval: The period between subsequent retries
  • probePollingInterval: The time interval for which continuous probe should be sleep after each iteration.
  • initialDelaySeconds: Represents the initial waiting time interval for the probes.

Let us take a look at the different probe categories in some more detail.

httpProbe

The httpProbe allows developers to specify a URL which the experiment uses to gauge health/service availability (or other custom conditions) as part of the entry/exit criteria. The received status code is mapped against an expected status. It supports http Get and Post methods.

In HTTP Get method, it sends an HTTP Get request to the provided URL and matches the response code based on the given criteria(==, !=, oneOf).

In the HTTP Post method, it sends an HTTP Post request to the provided URL. The HTTP body can be provided in the body field. In the case of a complex Post request in which the body spans multiple lines, the bodyPath attribute can be used to provide the path to a file consisting of the same. This file can be made available to the experiment pod via a ConfigMap resource, with the ConfigMap name being defined in the ChaosEngine OR the ChaosExperiment CR.
It can be defined at .spec.experiments[].spec.probe inside ChaosEngine.

NOTE: body and bodyPath are mutually exclusive.

The httpProbe is better used in the Continuous mode of operation as a parallel liveness indicator of a target or downstream application. It uses the probePollingInterval property to specify the polling interval for the access checks.

NOTE: insecureSkipVerify can be set to true to skip the certificate checks.

cmdProbe

The cmdProbe allows developers to run shell commands and match the resulting output as part of the entry/exit criteria. The intent behind this probe was to allow users to implement a non-standard & imperative way of expressing their hypothesis. For example, the cmdProbe enables you to check for specific data within a database, parse the value out of a JSON blob being dumped into a certain path, or check for the existence of a particular string in the service logs.

In order to enable this behavior, the probe supports an inline mode in which the command is run from within the experiment image as well as a source mode, where the command execution is carried out from within a new pod whose image can be specified. While inline is preferred for simple shell commands, source mode can be used when application-specific binaries are required. The cmdProbe can be defined at .spec.experiments[].spec.probe the path inside the ChaosEngine.

k8sProbe

With the proliferation of custom resources & operators, especially in the case of stateful applications, the steady-state is manifested as status parameters/flags within Kubernetes resources. k8sProbe addresses verification of the desired resource state by allowing users to define the Kubernetes GVR(group-version-resource) with appropriate filters (field selectors/label selectors). The experiment makes use of the Kubernetes Dynamic Client to achieve this. The k8sProbe can be defined at .spec.experiments[].spec.probe the path inside the ChaosEngine.

It supports the following CRUD operations which can be defined at probe.operation.

create: It creates Kubernetes resource based on the data provided inside probe.data field.
delete: It deletes matching Kubernetes resource via GVR and filters (field selectors/label selectors).
present: It checks for the presence of Kubernetes resource based on GVR and filters (field selectors/labelselectors).
absent: It checks for the absence of Kubernetes resource based on GVR and filters (field selectors/labelselectors).

Probe Status & Deriving Inferences

The Litmus chaos experiments run the probes defined in the ChaosEngine and update their stage-wise success in the ChaosResult custom resource, with details including the overall probeSuccessPercentage (a ratio of successful checks v/s total probes) and failure step, where applicable. The success of a probe is dependent on whether the expected status/results are met and also on whether it is successful in all the experiment phases defined by the probe’s execution mode. For example, probes that are executed in “Edge” mode, need the checks to be successful both during the pre-chaos & post-chaos phases to be declared as successful.

The pass criteria for the experiment is a logical AND function of all the probes defined in the ChaosEngine as well as inbuilt entry/exit criteria. Failure of either indicates a failed hypothesis and is deemed experiment failure. And an opportunity to fix the underlying problem!

Provided below is a chaos result snippet containing the probe status for a mixed-probe chaos engine.

Conclusion

The probes are an effective mechanism to burn in hypotheses into chaos experiments and arm them with more meaning & context. In some ways, they address the “opacity” today with respect to what constitutes an experiment pass or failure. Probes are also useful analytics aids indicating resiliency over a period of time. For example, the probeSuccessPercentage for a given experiment (with a set of mandatory probes) against a specific application can be tracked over time to gauge the progress being made in the feature & deploy practices.

One of the questions we got from the community as we set out to build this feature is whether the probes are a replacement for application-specific chaos experiments. The answer is “Not really”. While the probes do enable reuse of the generic experiments and give it an app context, they are not intended to perform deep application-level verification, much less inject app-specific faults. Take the example of the Kafka chaos experiments on the hub — for example. The focus here is on identifying a certain type of app replica to inject chaos on & use test downstream apps (to simulate the producer/consumer) for validation purposes. The probes can be useful in enhancing these experiments and therefore work more as aids than replacements.

Hope this feature helps you practice Chaos Engineering in an even better way. Do try it & let us know what you think.

Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you? Join Our Community #litmus channel in Kubernetes Slack
Contribute to LitmusChaos and share your feedback on Github
If you like LitmusChaos, become one of the many stargazers here.

--

--