Running Chaos-Engineering experiments against a Kubernetes clusters

and providing cool outputs via PDFs

Pablo Del Giudice
Globant
5 min readOct 3, 2020

--

For those who are not familiar with the practice of Chaos Engineering, Chaos is the discipline of experimenting on a software system in production in order to build confidence in the system’s capability to withstand turbulent and unexpected conditions.

Theses practices have been proved and adopted by many companies in order to learn about how your system works, testing reliability, simulating DoS, introduce Network failures, etc.

Tools available

In order to perform Chaos Engineering you can develop your own scripts or software related to perform your experiments. But let me share with you some interesting tooling available to perform it:

Pumba, Grembling, Chaos Monkey, Powerful Seal, kube-monkey, Litmus, Gloo Shot, Chaos Toolkit

For this post, We are going to use Chaos toolkit due to is open-source and multi platform.

Chaos Toolkit under the hood

Chaos Engineering Process

When We are talking about Chaos experiments it is very important to have in mind the “observability” of the whole ecosystem. By doing that you will be able to understand the experiment you are doing (cause / effect) and will allow you to “learn” from the experiment.

Having in place the right tooling to monitor the chaos experiment is key condition before start doing your experiments.

Setting up the Environment

Getting inside the environment:

Installing Chaos Toolkit + Chaos Toolkit Kubernetes plugin:

When finished installing the chaostoolkit-kubernetes plugin, it will provide to you many functions that you can use to create your experiments.
In my case I’m going to use: all_microservices_healthy, in order to validate the hypothesis and terminate_pods with an argument rand: true in order to kill any pod within a namespace, in my case ns: go-demo-8.

Below you will see how my experiment look like

NOTE: Under https://docs.chaostoolkit.org/ , you can find many other functions

Experiments

Key stages:

1-Version, title, tags (a descriptive part to identify experiment)
2- Hypothesis (where you define a probe to validate your steady state)
3- Experiment (under method you will describe what you want to test, in my case kill a random pod)
4- Rollback (if any, in my case I haven’t define any due to the self healing of the cluster)

This experiment is designed to:

1- Run on a K8s Cluster
2- Will use the function “all_microservices_healthy” to validate our hypothesis
3- Kill a random Pod under the namespace: go-demo-8

How to run your experiment?

Where should we run experiments?

Application layer
Your code has features, behaviors, and flows. Try them.

Cache layer
Modern applications rely more and more on caches. What if a cache isn’t available? What will happen if a transaction / operation is affected due to a cache layer issue?

Database layer
Shutdown the database, relax and see what happens.

Cloud layer
It is not new that at 2020, many digital platforms relies on cloud providers, therefore you may be affected by many issues: EC2s unstable, AZs Degradation, Network delays, etc. So it’s interesting to see how the whole platform behavior is shown under our experiments.

After we run our experiments, what should we do?

Basically some answer are“generate reports” or “to automate corrective actions”, but how to do so?

Well Chaos toolkit has the answer for it, and the answer is: you have to generate a “journal”.

So test.json should be the output for experiment and it should look like:

As you might see JSON it has a human-readable output, but it not a friendly one.

If you are trying to use a machine to take an action it’s OK. But how about if you need to generate a nice report to your manager in order to allow another team to start fixing the issue that you might found by doing you chaos experiments.

Generating Chaos Engineering reports on PDF

By using the chaos toolkit reporting tool you can create a pretty nice PDF reports. The tool comes in a docker format, so if you are familiar with it you can use:

After the docker ingest you journal (test.json) it will generate report.pdf with the output of you experiment as follows:

Here you can see the actual environment running the experiment, monitoring through kubectl the experiment and the chaos toolkit CLI output

Remember that final idea of Chaos, is to run experiments in production to improve reliability, learn about our application to generate playbooks, runbooks and to develop the IT fitness needed for example to handle Cybermondays, Blackfridays or face unknown disrupting-issues on our platforms.

If you are looking for more chaos example like, testing istio, adding network latency to the cluster or similar, leave me a message and I will send to you separately

Hope this helps to start improving your apps.

Pablo!

Links

--

--