Providing chaos hooks to applications through Litmus Operator
Litmus operator allows developers and DevOps architects introduce chaos into the applications and Kubernetes infrastructure in a declarative intent format, in other words — the Kubernetes way
Litmus is gaining traction in the community as a preferred means of injecting chaos in Kubernetes-based CI/CD pipelines (see reference use-cases for NuoDB, Prometheus & Cassandra), and as one of the contributors to this project, that is heartening! One of the key benefits that Litmus brings to the table, in simple terms, is the fact that a chaos test/experiment can be run as a Kubernetes job with a custom resource as a test result. As you can discern, this is a model that promises easy integration with CI systems to implement chaos-themed e2e pipelines.
Why do we need a chaos operator and a workflow?
While what we have is Kubernetes native, the community felt that the toolset should be improved further to encourage its use in the actual places where chaos is thriving today: deployment environments (may they be Dev/Staging/Pre-Prod/Production). Though this doesn’t mean Litmus in its current form cannot be used against such environments (visit the openebs workload dashboards to run some live chaos on active prod-grade apps!), there are some compelling differences, or rather, needs that have to be met by the chaos frameworks to operate efficiently here. Some of the core requirements identified were:
- Ability to schedule a chaos experiment (or a batch run of several experiments).
- Ability to monitor & visualize chaos results mapped to an application over a period of time, thereby ascertaining its resiliency.
- Ability to run continuous-chaos as a background service based on filters such as annotations. This also implies the need for a resilient chaos execution engine that can tolerate failures & guarantee test-run resiliency.
- Standardized specs for chaos experiments with an option to download categorized experiment bundles.
In short, chaos needs to be orchestrated !!
The lifecycle of a chaos experiment
We define three steps in the workflow of chaos orchestration —
- Definition of a chaos experiment — the nature of chaos itself.
- The scheduling of such chaos — How often the chaos needs to be run.
- Predefined chaos experiments on a per-application basis as reference templates, which we call as chaos charts.
We address the above requirements by making use of Kubernetes Custom Resources, Kubernetes Operators & Helm Charts, respectively.
ChaosEngine: Specifying the Chaos Intent
The ChaosEngine is the core schema that defines the chaos workflow for a given application & is the single source of truth about actions requested & performed on it, in terms of chaos injections. Currently, it defines the following:
- Application Data (namespace, labels, kind)
- List of Chaos Experiments to be executed
- Attributes of the experiments, such as rank/priority
- Execution Schedule for the batch run of the experiments
It is expected to be created and applied by the Developer/DevOps/SRE persona, with the desired effect of triggering the chaos workflow specified.
Here is a sample ChaosEngine Spec for reference:
Chaos Operator: Automating the Chaos Workflow
Operators have emerged as the de-facto standard for managing the lifecycle of non-trivial & non-standard resources (read: applications) in the Kubernetes world. In essence, these are nothing but custom-controllers with direct access to the Kubernetes API, which execute reconcile functions to ensure that the desired state of a given custom resource is always met.
The Litmus Chaos Operator reconciles the state of the ChaosEngine, it's a primary resource & performs specific actions upon CRUD operations of the ChaosEngine CR. It is built using the popular Operator-SDK framework, which provides bootstrap support for new operator projects, allowing teams to focus on business/operational logic. The operator, which itself runs as a Kubernetes deployment, also defines secondary resources (the engine runner pod and engine monitor service), which are created & managed by it in order to implement the reconcile functions.
The Chaos Operator supports selective injection of chaos on applications, via an annotation litmuschaos.io/chaos: “true”, with the reconcile skipping applications that have chaos disabled.
Engine Runner Pod: This pod is launched by the Chaos Operator with desired app information burned in (typically, as ENV) upon creating an instance of the ChaosEngine CR. It consists of the main runner container that either executes experiments or spawns experiment executors (litmusbooks) as well as an engine monitor sidecar, which is a custom Prometheus exporter to collect chaos metrics. The state & results of these experiments are maintained in ChaosEngine CR & ChaosResult CRs.
Engine Monitor Service: The monitor service exposes the /metrics endpoint to allow scrape functions by Prometheus or other similar supported monitoring platforms.
As described, the chaos exporter is tied to a ChaosEngine custom resource, which, in turn, is associated with given application deployment. Two types of metrics are exposed:
Fixed: TotalExperimentCount, TotalPassedTests, TotalFailedTests which are derived from the ChaosEngine specification upfront & the overall experiment results.
Dynamic: Represents individual experiment run status. The list of experiments may vary across ChaosEngines (or newer tests may be patched into a given ChaosEngine CR) The exporter reports experiment status as per list in the ChaosEngine. Currently, the status of the experiments are represented via numerical values (Not-Executed: 0, Running: 1, Fail: 2, Pass: 3)
The metrics carry the application_uuid as a label in order to aid dashboard solutions like Grafana to filter metrics against deployed applications.
Chaos Charts: Packaging the Chaos Experiments
While the ChaosEngine defines the overall chaos intent & workflow for an application, there is still a need to specify lower-level chaos experiment parameters, and the parameter list changes on a case-by-case basis. In the non-operator Litmus world, this is specified inside the litmusbook job as ENV. However, with the current requirements, these are needed to be placed inside a dedicated spec, with similar such specs packaged together to form a downloadable chaos experiment bundle.
This is achieved defining another custom resource called “ChaosExperiment”, with a set of these Custom Resources (CRs) packaged and installed using the Helm Kubernetes package manager, as “Chaos Charts”. The chaos charts hold together experiments belonging to a given category — such as general Kubernetes Chaos, Provider specific (for ex: OpenEBS) Chaos as well as Application specific Chaos (for ex: NuoDB).
These ChaosExperiments are listed/referenced in the ChaosEngine with their respective execution priority levels & are read by the executors to inject desired chaos.
Here is a sample ChaosExperiment Spec for reference:
The spec.definition.fields and their corresponding values are used to construct the eventual execution artifact that runs the chaos experiment (typically, the litmusbook).
Next steps in the Chaos Operator development
The Litmus Chaos Operator is alpha today & is capable of performing batch runs of standard chaos experiments such as random pod failures & container crashes against applications annotated for chaos, with chaos metrics collected for these runs. As I write this, support for scheduled chaos & priority based execution is being worked on & should be available very soon !! The immediate roadmap also includes support for more useful metrics, such as an overall app resiliency metric (derived from the chaos runs) as well as additional provider & app-specific chaos chart bundles.
As always, we welcome feedback & contributions (via issues, proposals, code, blogs, etc.) & would generally love to hear what you think about this project.