Applying Chaos Experiments using Gremlin with the Chaos Toolkit

Support for Gremlin now added to the open source Chaos Toolkit

The Chaos Toolkit now has Gremlin Integration

In a previous post I explained the mission of the open source Chaos Toolkit as being to “create a toolkit that makes it easy to try out, adopt and learn about the benefits of Chaos Engineering.

In the first versions of the toolkit we aimed to support creating simple chaos engineering experiments against systems hosted in Kubernetes that you could then use to probe the effects of that chaos on your own systems, learning how to improve a system based on its weaknesses.

With those simple underpinnings in place we also knew that rapidly we would want to integrate with a host of systems and technologies to extend the power of the experiments that could be run.

Today I’m really excited to announce that the first new system we’ve integrated the Chaos Toolkit with is Gremlin!

Integrating Gremlin

Gremlin is a full Failure as a Service system, ready to apply Chaos to large-scale systems through a multitude of great features. Thanks in no small part to the Gremlin team’s openness and fantastic efforts to help us get to grips with all the power of the Gremlin toolset, we could quickly and easily provide an open integration so that you could run your Chaos Toolkit experiments through the power of Gremlin’s Attacks.

The Gremlin system is enabled through a Client, Agents, the core Gremlin Services and an API.

The Key Parts of a Gremlin System

For our purposes we could deploy the Gremlin Agent as a daemon set in Kubernetes on GKS to prepare the ground for Gremlin-orchestrated chaos. The daemon set of Gremlin Agents take orders supplied via the Gremlin Services, which is in turn controlled by the Gremlin web application, the Gremlin Client, or the API. The Chaos Toolkit gains Gremlin support through integrating with the Gremlin API in order to run experiments using Gremlin’s Attacks.

Executing a Chaos Toolkit Experiment using Gremlin’s Attacks

The support for Gremlin is provided by the new chaostoolkit-gremlin project. This project’s README.md walks you through how to construct an entire experiment using Gremlin’s integration.

The key part in the experiment is the new Gremlin actions where you can now specify Gremlin Attacks to use as part of your chaos experiments:

The Chaos Toolkit experiment’s action includes a command and a target, and we also accept labels and tags that match Docker container attributes.

We have kept to a clean, pass-through integration to the Gremlin APIs for these parameters and the concepts are best explained as:

In the sample’s case you can see that a CPU resource exhaustion attack is being applied using Gremlin, and that it will randomly target any of the hosts that include a Gremlin Agent:

Secrets Support

In addition to the Gremlin integration we have also added support for secrets, which is especially useful when integrating the Chaos Toolkit’s experiments with external services such as Gremlin.

There is now a secrets block you can specify for your Chaos Toolkit experiment:

So that you don’t have to embed secrets directly into your experiments (universally considered a dangerous anti-pattern!), you can see in the snippet above that the contents of the secrets can be sourced from environment variables wherever your experiment is executed by using the env prefix.

Background Asynchronous Actions and Probes Support

Many actions in a chaos experiment will be long-running and so it would be useful to execute many actions and probes in parallel to one another. This is now possible using the background property on Chaos Toolkit’s actions and probes:

In the sample above you can see that the action has "background": true specified so the Gremlin Attack action will be executed in parallel with any other probes or actions in the experiment. As well as specifying that an action or probe is executed in the background you can also specify a timeout.

Learning from Chaos through Gremlin

With the new Gremlin support you can build Chaos Toolkit experiments that enable you to learn about a vast array of system weaknesses. Next on our list is to work on turning that Gremlin-instigated chaos into a collection of new probes that we can source and learn from across the system.

For example we’re going to be looking into what probes do we need to validate that correct alerts are being triggered by chaos experiments. Integrations with systems such as Prometheus will likely feature high on this list!

We’re also always very interested to hear your own requirements around Chaos Engineering and are looking for your input as we collectively extend the toolkit as a community to make it even more valuable. If you want to get involved then the best place is to come join our community Slack team, or if you know what you’d like things to do perhaps even by submitting a PR to one of the open source projects!