The application of Chaos

Mick Roper
The Startup

--

Chaos engineering — the practice of introducing turbulence into a running system to ensure it can handle that turbulence — is a great thing. Done properly, it allows engineers to introduce systems to users and have a high degree of confidence that those systems will perform well when under stress.

A question I get asked quite a lot is ‘how do we do chaos engineering?’ This is a great question! Chaos engineering leans on the scientific method to produce experiments with measurable deltas from a known baseline. It also shares one other thing I would like to explicitly define:

We aren’t trying to break things just to see them burn! We are trying to prove scientifically that a hypothesis is correct, and if not then why not.

Imagine you are a scientist: you strongly suspect that a certain chemical reaction will release noxious gas. You perform the experiment but prepare appropriately: you wear a gas mask and gloves, and ensure that no-one else will walk in and be affected. Chaos engineering is no different! If you suspect that doing a thing will cause pain and suffering, you should prepare for it! If you suspect that ‘dropping this instance will cause a massive outage’ then manage the blast radius, and if you can’t manage it then the experiment can only reinforce a negative outcome you already suspect.

The goal is to learn something positive. Try to form a theory that will have a nett-positive outcome. for instance; ‘dropping this instance should cause a the failover server to kick-in within 30 seconds’. If successful then this theory gives you confidence (a positive effect), and it not it shows you a point of weakness within your system. The problem always existed, so the ‘negative’ aspect was always there and hasn’t been made any worse, so your confirmed knowledge of it (a positive effect) means the overall result of the experiment is ‘nett-positive’.

Many people have heard about chaos engineering from Netflix. They invented a toolkit known as the Simian Army that would test the resiliency of their applications. However, what few people realise is that they did this to prove that they couldn’t break production! The Simian Army came in after significant work to harden the services had already been done. They had prepared effectively, and used the Simian Army as a vehicle to run experiments to test their theories about what would happen during a disastrous scenario.

So how does the scientific method and software engineering come together? Below is a process I find useful. Please feel free to cherry pick from this as you desire but I would suggest you be rigorous about the process: changing your method of experimentation between experiments can damage your ability to empirically compare the results of those experiments. Remember that we are trying to develop trust in our systems, and by extension, in us as engineers.

  1. Find a system that you are able to change, can be measured with an acceptable degree of accuracy, and wish to experiment on. Remember: you control the experiment. Don’t be afraid to experiment on ‘big ticket’ systems — simply manage the experiment to control the blast radius of a failure.
  2. Define some arbitrary parameters that you will fix during the experiment. Often times the easiest parameter to fix is ‘time’ — measure the outcome of your experiment over a given window of time. You might also try to fix more ephemeral parameters: the type of user that is accessing your system during the experiment, the load on the service, the amount of resources dedicated to the system, etc.
  3. Measure the ‘normal state’ of your system. This baseline is critical if you want to understand the impact of your experiment. Measure the metrics of the system — as many as are available, adhering to the parameters you defined in step 2. Don’t massage this number! If the system has an uptime of 50% during this time window, then that is your normal! If, for whatever reason you feel that the baseline is ‘wrong’ you should report that and try to find out why it’s wrong before continuing. Perhaps the thing you are using to measure the system is unreliable, or some external factor is introducing noise.
  4. Using the metrics available, formulate an experiment with a measurable outcome. For instance: “I hypothesize that I can add an extra instance of the web component, and this will reduce the average request/response latency of the system”. This is a fairly straightforward experiment, but it also a perfect entrypoint for introducing the scientific method into your engineering. The initial and end states are both measurable, the delta is easy to visualise and comprehend and the number of variables (in this case the change in the number of web components) is kept to a manageable number. The chaotic extension to this experiment is ‘if I remove this web component from the landscape the user should see no measurable degradation in their experience’.
  5. Evaluate the blast radius of the experiment. In the example above we need to calculate the impact that running another web component will have. It might be very little, or it might stress the infrastructure, risking several other components. Look at what could be affected in a worst case scenario — if the impact of failure is too high then look into alternative experiments (maybe you could test the resilience of the components that this experiment puts in jeopardy?) or refactor your experiment to bring the risk to acceptable levels. If you do decide to alter your experiment, be aware that you may have to reestablish the measure of ‘normal’ state using these new parameters. For instance, deciding to run your experiment at 3am rather than the original plan of 3pm might result in a significantly different ‘normal state’ for your target component.
  6. Run your experiment. Since we’re trying to avoid adverse effects during the test you need to actively monitor the system. If the system begins behaving poorly (it drifts away from ‘normal’ in an undesirable way) you need to be willing to pull the plug on your experiment, and get the system back to normal asap. Think of this from the perspective of a science experiment: if your experiment starts creating harmful radiation, don’t wait for the experiment to finish (and the radiation to kill people for miles around) when it is possible to terminate early and still learn something useful from the experiment, while managing the impact of the undesirable consequences.
  7. Reestablish normal. Depending on your experiment you may need to do some work to reestablish normal. Since you’re pretty much always trying to have a ‘zero or nett-positive’ outcome from your experiment then this will probably involve rolling back a failed experiment, but it may also be preferable to undo a positive experiment before presenting results, especially if the experiment is thought of as dangerous or contrary to workplace policy. I personally encountered this situation when experimenting with the benefits of gRPC on a microservice stack— while it had no demonstrable negative consequence, it was far enough away from established practice that it was better to remove it from the landscape until I could present my findings. Sometimes it’s better to play politics than run afoul of them.
  8. Learn. Look at the results of your experiment, compare them with the ‘normal’ measure and compare the delta with your hypothesis. Using our earlier example, if the average latency was 100ms before the test and 80ms afterwards then the result backs up the hypothesis. If not, then you should include theories about ‘why’ you think a negative result has occurred. Again, try to back this up with data. e.g. ‘adding the extra web component doubled the number of concurrent connections to the database, causing a measured increase in transaction time and network load that exceeded the benefit brought by extra web throughput’.
  9. Present. Regardless of whether the result was positive or negative, you should always publish your results! There is no such thing as ‘bad facts’. So long as the data is included in the publication, is accurate and verifiable, then something can be learned — even if that thing is ‘don’t do that again’! It’s worth noting that ‘present’ does not need to mean ‘produce a slide deck and get an audience together’. An email, a web page, a screen shot of a Grafana dashboard and a paragraph about it are all valid ways to share the results of the experiment.

So why do we do this? The simple fact is that running a system reliably is hard to do. In software engineering it is all too easy to follow theories of best practice, but real life rarely cooperates with theories. By introducing disciplined experiments that integrate real-world conditions you can assert with some confidence how valid your theories are.

Modern systems tend to be built quickly — we often sacrifice things like ‘optimisation’ and ‘disaster planning’ to achieve this speed, hoping we can club these problems to death with processing power and RAM — especially in the cloud. These tradeoffs are fine to make, but poorly optimised code has downsides that you need to understand. Until you shine a light on them, these ‘dark downsides’ impact your system in ways you cannot see, and often don’t know about until the system is under stress, at which point you see the effects and not the cause, and are often forced to perform remediation action without the full picture. Chaos engineering encourages us to introduce that stress in a managed way, allowing us to understand these dark parts of our codebase before it starts misbehaving.

As a wise person once said, an ounce of prevention is worth a pound of cure

The overall outcome of chaos engineering is to build systems able to operate in a turbulent world. User numbers fluctuate, exploits are found, systems crash. This turbulence is always going to happen — pretending it wont is naive. By embracing the turbulence and actually engineering your applications in such a way that they can embrace that disruption it is possible to build systems you have confidence in.

--

--