Chaos Engineering with Humans in the Loop

Adding human interaction to experiments in the Chaos Toolkit

Published in

Chaos Toolkit

6 min readMay 30, 2019

Following on from the article that introduced the Chaos Toolkit Community Playground, this article introduces some more sample code in the playground that shows you how you can create a new Chaos Toolkit control to enable various styles of user interaction where needed when an automated chaos experiment is being executed.

NOTE: This article was originally inspired by a conversation with Simon Lindroos on the Chaos Toolkit Community Slack, and also relates to the older issue originally raised on the Chaos Toolkit concerning interaction needed when “dangerous” activities may be executed as part of an experiment’s execution.

The Need for User Interaction in Automated Chaos Experiments

By default, Chaos Toolkit experiments are assumed to be run from beginning to end without user interaction. This is still the most desirable strategy, as it makes it possible to schedule experiments to be executed automatically without having to notify the team to make people available. Two cases where this is useful are when your chaos experiments are being used as validating tests as part of a CI/CD pipeline, or as part of scheduled continuous chaos.

However there are a number of cases when parts of, or all of, an experiment should involve user interaction. One example could be a Game Day where an experiment is being used to automate the setup and teardown of the turbulent conditions being explored and the person running the Game Day and manually progress each automated step. As Simon Lindroos pointed out in the original discussion on the Community Slack, another example of where this control of an experiment’s execution could involve an experiment such as:

Check steady state hypothesis: Look at some metrics (Automated)
Alter some configuration in Legacy HR system (Manual step, waits for user to make change and confirm before proceeding)
Check metrics (Automated)
Check steady state hypothesis again (Automated)
Rollback (Manual step, restore old configuration)

To meet as many cases as possible, here you’ll see how the following user interaction styles can be implemented using the powerful [Chaos Toolkit Control] extension concept that is the “control”:

Simple “Click to continue…” style interaction.
“Yes/No” to running an activity style interaction.
“This is a Dangerous Activity, skip or proceed” enhancement to the previous style of interaction shows one strategy to meeting the need specified in this issue.

All of the code for this article is available in the Chaos Toolkit Community Playground project.

Introducing the three new Chaos Toolkit Extensions

It’s a good practice to create a separate Python virtual environment when working on a new Chaos Toolkit extension. Execute the following to create a new virtual environment called `chaosinteract`:

Now you can activate your new virtual environment by executing the following:

You know things are working if you see `(chaosinteract)` as a prefix to your command prompt.

NOTE: You might also consider installing the virtualenvwrapper, which comes with the convenient `workon` script for switching between Python virtual environments.

Now install the Chaos Toolkit according to the instructions available in the
project documentation.

With your Python virtual environment set up and your `chaos` command working, you’re now ready to explore and install each of the three different user interaction extensions to the Chaos Toolkit.

The Simple User Interaction Control

The first new Chaos Toolkit Control shows how to provide a simple “Press Enter to Continue…” interaction style. The code for this extension is shown below:

In the above code you are simply providing an implementation of the
before_activity_control Chaos Toolkit Control callback function. The function will be called before the execution of any experimental activity that the control has been applied to.

Installing your new Simple user interaction Control

To test your new simple user interaction control you first need to install it in your Python virtual environment so that it can be available to your installation of the Chaos Toolkit. You can do this by running the following commands in the `sources/simple` project directory:

You can now see that your new extension is installed as a Python module by
executing the `pip freeze` command:

NOTE: Your exact path and GIT Hash will be different, but the important thing to note is that you are using your new local chaostoolkit_chaossimpleinteract module.

Adding the “Simple” user interaction Control to an experiment

Once a Chaos Toolkit Control has been installed it can be explicitly used by
your experiments. The simple-interactive-experiment.json experiment show below is already set to use your new simple user interaction control:

This experiment contains three activities (all actions): one in the
steady-state-hypothesis, one in the method, and one in the rollbacks sections.

This should mean that you are prompted 4 times when the experiment is executed. Once when the steady-state hypothesis is executed at the beginning, once when the single step in the method is executed, once again when the steady-state hypothesis is analysed at the end of the experiment’s method execution, and then finally when the single step in the rollbacks is executed.

Seeing the “Simple” user interaction Control in action

Making sure that your chaosinteract Python virtual environment is still activated (you should see (chaosinteract) before your command prompt), you can now execute the chaos run command and see how your new simple user interaction control is applied:

Press any key each time you’re prompted to complete your experiment and …
perfect! You are being prompted to continue for each of the activities in your
experiment that the simple user interaction control has been applied to. You can now remove the control from any activities where you do not want to be prompted.

NOTE: If you wanted the same control applied to all activities, as was the case in this experiment, then you could simply add a controls block to the top level of your experiment and that control would be applied to all activities. The simple-interactive-experiment.json experiment is showing how you can apply the control specifically to any activity at the activity level.

Extending with a “Yes or No” style user interaction

Let’s go one step further. This time you’ll explore and apply another Chaos Toolkit control that, this time, asks the user to confirm that they would like to proceed. If they choose not to proceed then the experiment is aborted, and only the rollbacks will be run after that confirmation.

The control’s code is provided in the sources/yesorno Python module. Once again, the code is contained in a control.py file:

This time your control is using the click library’s confirm feature to prompt the user as to whether they would like to continue and execute the activity, or abort the whole experiment. The experiment is aborted using the Chaos Toolkit’s InterruptException.

Install this new Chaos Toolkit extension as before by entering the following command while in the sources/yesorno directory, making sure you are still using the chaosinteract Python virtual environment:

Now you can head over to the samples directory and run the yorn-interactive-experiment.json experiment to use this new control.

Removing a control does not stop the experiment executing

One thing that is good to be aware of is that Chaos Toolkit controls are declared as optional. This means that if a control is not present, then it is simply ignored and the experiment is still considered valid and executable.

This is the opposite of Chaos Toolkit drivers that provide probes and actions. If a probe or an action is declared to use a particular driver module’s functions and it cannot be found when the experiment is executed or validated, then the experiment will not run and will fail validation. Not so with controls.

To see this in action, try removing your chaostoolkit-chaosyorn control’s module with the following command:

Now try running the experiment that uses the chaostoolkit-chaosyorn control and you will see the following:

If a Chaos Toolkit Control is referenced, but not available, a warning is issued
and added to the chaostoolkit.log. Controls are considered optional and so this is expected. This gives you the power to create experiments that can involve human interaction when a human is around, and to turn that off completely when the experiment is being executed in an environment when a human being around is unlikely (i.e. CI/CD).

Summary

In this article you saw how to implement two Control extensions to the Chaos Toolkit. The Control extension is a powerful way of seizing control (literally) of the execution of your experiments. The examples here showed how you can harness that power to incorporate human interaction styles into your automated chaos experiments.