Running Chaos-Engineering experiments against a Kubernetes clusters
and providing cool outputs via PDFs
For those who are not familiar with the practice of Chaos Engineering, Chaos is the discipline of experimenting on a software system in production in order to build confidence in the system’s capability to withstand turbulent and unexpected conditions.
Theses practices have been proved and adopted by many companies in order to learn about how your system works, testing reliability, simulating DoS, introduce Network failures, etc.
Tools available
In order to perform Chaos Engineering you can develop your own scripts or software related to perform your experiments. But let me share with you some interesting tooling available to perform it:
Pumba, Grembling, Chaos Monkey, Powerful Seal, kube-monkey, Litmus, Gloo Shot, Chaos Toolkit
For this post, We are going to use Chaos toolkit due to is open-source and multi platform.
Chaos Engineering Process
When We are talking about Chaos experiments it is very important to have in mind the “observability” of the whole ecosystem. By doing that you will be able to understand the experiment you are doing (cause / effect) and will allow you to “learn” from the experiment.
Having in place the right tooling to monitor the chaos experiment is key condition before start doing your experiments.
Setting up the Environment
$ python3 -m venv ~/.venvs/chaostk
Getting inside the environment:
$ source ~/.venvs/chaostk/bin/activate
Installing Chaos Toolkit + Chaos Toolkit Kubernetes plugin:
$ pip install -U chaostoolkit$ pip install -U chaostoolkit-kubernetes
When finished installing the chaostoolkit-kubernetes plugin, it will provide to you many functions that you can use to create your experiments.
In my case I’m going to use: all_microservices_healthy, in order to validate the hypothesis and terminate_pods with an argument rand: true in order to kill any pod within a namespace, in my case ns: go-demo-8.
Below you will see how my experiment look like
NOTE: Under https://docs.chaostoolkit.org/ , you can find many other functions
Experiments
Key stages:
1-Version, title, tags (a descriptive part to identify experiment)
2- Hypothesis (where you define a probe to validate your steady state)
3- Experiment (under method you will describe what you want to test, in my case kill a random pod)
4- Rollback (if any, in my case I haven’t define any due to the self healing of the cluster)
version: 1.0.0
title: What happens if we terminate an instance of the application?
description: If an instance of the application is terminated, a new instance should be created
tags:
- k8s
- pod
- deployment
steady-state-hypothesis:
title: The app is healthy
probes:
- name: all-apps-are-healthy
type: probe
tolerance: true
provider:
type: python
func: all_microservices_healthy
module: chaosk8s.probes
arguments:
ns: go-demo-8
method:
- type: action
name: terminate-app-pod
provider:
type: python
module: chaosk8s.pod.actions
func: terminate_pods
arguments:
# label_selector: app=go-demo-8
rand: true
ns: go-demo-8
pauses:
after: 10
This experiment is designed to:
1- Run on a K8s Cluster
2- Will use the function “all_microservices_healthy” to validate our hypothesis
3- Kill a random Pod under the namespace: go-demo-8
How to run your experiment?
$ chaos run terminate-pod-rand.yaml
Where should we run experiments?
Application layer
Your code has features, behaviors, and flows. Try them.
Cache layer
Modern applications rely more and more on caches. What if a cache isn’t available? What will happen if a transaction / operation is affected due to a cache layer issue?
Database layer
Shutdown the database, relax and see what happens.
Cloud layer
It is not new that at 2020, many digital platforms relies on cloud providers, therefore you may be affected by many issues: EC2s unstable, AZs Degradation, Network delays, etc. So it’s interesting to see how the whole platform behavior is shown under our experiments.
After we run our experiments, what should we do?
Basically some answer are“generate reports” or “to automate corrective actions”, but how to do so?
Well Chaos toolkit has the answer for it, and the answer is: you have to generate a “journal”.
$ chaos run terminate-pod.yaml — journal-path test.json
So test.json should be the output for experiment and it should look like:
As you might see JSON it has a human-readable output, but it not a friendly one.
If you are trying to use a machine to take an action it’s OK. But how about if you need to generate a nice report to your manager in order to allow another team to start fixing the issue that you might found by doing you chaos experiments.
Generating Chaos Engineering reports on PDF
By using the chaos toolkit reporting tool you can create a pretty nice PDF reports. The tool comes in a docker format, so if you are familiar with it you can use:
$ docker container run \
--user $(id -u) \
--volume $PWD:/tmp/result \
-it \
chaostoolkit/reporting \
-- report \
--export-format=pdf \
Test.json \
report.pdf
After the docker ingest you journal (test.json) it will generate report.pdf with the output of you experiment as follows:
Here you can see the actual environment running the experiment, monitoring through kubectl the experiment and the chaos toolkit CLI output
Remember that final idea of Chaos, is to run experiments in production to improve reliability, learn about our application to generate playbooks, runbooks and to develop the IT fitness needed for example to handle Cybermondays, Blackfridays or face unknown disrupting-issues on our platforms.
If you are looking for more chaos example like, testing istio, adding network latency to the cluster or similar, leave me a message and I will send to you separately
Hope this helps to start improving your apps.
Pablo!