Running Chaos-Engineering experiments against a Kubernetes clusters

and providing cool outputs via PDFs

Published in

Globant

5 min readOct 3, 2020

For those who are not familiar with the practice of Chaos Engineering, Chaos is the discipline of experimenting on a software system in production in order to build confidence in the system’s capability to withstand turbulent and unexpected conditions.

Theses practices have been proved and adopted by many companies in order to learn about how your system works, testing reliability, simulating DoS, introduce Network failures, etc.

Tools available

In order to perform Chaos Engineering you can develop your own scripts or software related to perform your experiments. But let me share with you some interesting tooling available to perform it:

Pumba, Grembling, Chaos Monkey, Powerful Seal, kube-monkey, Litmus, Gloo Shot, Chaos Toolkit

For this post, We are going to use Chaos toolkit due to is open-source and multi platform.

Chaos Engineering Process

When We are talking about Chaos experiments it is very important to have in mind the “observability” of the whole ecosystem. By doing that you will be able to understand the experiment you are doing (cause / effect) and will allow you to “learn” from the experiment.

Having in place the right tooling to monitor the chaos experiment is key condition before start doing your experiments.

Setting up the Environment

$ python3 -m venv ~/.venvs/chaostk

Getting inside the environment:

$ source ~/.venvs/chaostk/bin/activate

Installing Chaos Toolkit + Chaos Toolkit Kubernetes plugin:

$ pip install -U chaostoolkit$ pip install -U chaostoolkit-kubernetes

When finished installing the chaostoolkit-kubernetes plugin, it will provide to you many functions that you can use to create your experiments.
In my case I’m going to use: all_microservices_healthy, in order to validate the hypothesis and terminate_pods with an argument rand: true in order to kill any pod within a namespace, in my case ns: go-demo-8.

Below you will see how my experiment look like

NOTE: Under https://docs.chaostoolkit.org/ , you can find many other functions

Experiments

Key stages:

1-Version, title, tags (a descriptive part to identify experiment)
2- Hypothesis (where you define a probe to validate your steady state)
3- Experiment (under method you will describe what you want to test, in my case kill a random pod)
4- Rollback (if any, in my case I haven’t define any due to the self healing of the cluster)

version: 1.0.0
title: What happens if we terminate an instance of the application?
description: If an instance of the application is terminated, a new instance should be created
tags:
- k8s
- pod
- deployment
steady-state-hypothesis:
  title: The app is healthy
  probes:
  - name: all-apps-are-healthy
    type: probe
    tolerance: true
    provider:
      type: python
      func: all_microservices_healthy
      module: chaosk8s.probes
      arguments:
        ns: go-demo-8
method:
- type: action
  name: terminate-app-pod
  provider:
    type: python
    module: chaosk8s.pod.actions
    func: terminate_pods
    arguments:
      # label_selector: app=go-demo-8
      rand: true
      ns: go-demo-8
  pauses:
    after: 10

This experiment is designed to:

1- Run on a K8s Cluster
2- Will use the function “all_microservices_healthy” to validate our hypothesis
3- Kill a random Pod under the namespace: go-demo-8

How to run your experiment?

$ chaos run terminate-pod-rand.yaml

Where should we run experiments?

Application layer
Your code has features, behaviors, and flows. Try them.

Cache layer
Modern applications rely more and more on caches. What if a cache isn’t available? What will happen if a transaction / operation is affected due to a cache layer issue?

Database layer
Shutdown the database, relax and see what happens.

Cloud layer
It is not new that at 2020, many digital platforms relies on cloud providers, therefore you may be affected by many issues: EC2s unstable, AZs Degradation, Network delays, etc. So it’s interesting to see how the whole platform behavior is shown under our experiments.

After we run our experiments, what should we do?

Basically some answer are“generate reports” or “to automate corrective actions”, but how to do so?

Well Chaos toolkit has the answer for it, and the answer is: you have to generate a “journal”.

$ chaos run terminate-pod.yaml — journal-path test.json

So test.json should be the output for experiment and it should look like:

As you might see JSON it has a human-readable output, but it not a friendly one.

If you are trying to use a machine to take an action it’s OK. But how about if you need to generate a nice report to your manager in order to allow another team to start fixing the issue that you might found by doing you chaos experiments.

Generating Chaos Engineering reports on PDF

By using the chaos toolkit reporting tool you can create a pretty nice PDF reports. The tool comes in a docker format, so if you are familiar with it you can use:

$ docker container run \
    --user $(id -u) \ 
    --volume $PWD:/tmp/result \
    -it \
    chaostoolkit/reporting \
     -- report \
     --export-format=pdf \
    Test.json \
    report.pdf

After the docker ingest you journal (test.json) it will generate report.pdf with the output of you experiment as follows:

Here you can see the actual environment running the experiment, monitoring through kubectl the experiment and the chaos toolkit CLI output

Remember that final idea of Chaos, is to run experiments in production to improve reliability, learn about our application to generate playbooks, runbooks and to develop the IT fitness needed for example to handle Cybermondays, Blackfridays or face unknown disrupting-issues on our platforms.

If you are looking for more chaos example like, testing istio, adding network latency to the cluster or similar, leave me a message and I will send to you separately

Hope this helps to start improving your apps.

Pablo!

Chaos Toolkit

The simplest and easiest way to explore building your own Chaos Engineering Experiments.

chaostoolkit.org

Chaos Toolkit

Dismiss GitHub is home to over 50 million developers working together. Join them to grow your own development teams…

github.com

Running Chaos-Engineering experiments against a Kubernetes clusters

and providing cool outputs via PDFs

Tools available

Chaos Engineering Process

Setting up the Environment

Experiments

How to run your experiment?

Where should we run experiments?

After we run our experiments, what should we do?

Generating Chaos Engineering reports on PDF

Links

Chaos Toolkit

The simplest and easiest way to explore building your own Chaos Engineering Experiments.

Chaos Toolkit

Dismiss GitHub is home to over 50 million developers working together. Join them to grow your own development teams…

Written by Pablo Del Giudice