Kubernetes Chaos Monkey on IBM Cloud Private

Learn how awesome jq is when writing simple bash scripts to wreak havoc on your deployments

Todd Kaplinger
IBM Cloud
6 min readFeb 1, 2018

--

Overview

Skill Level: Beginner
Some basic understanding of shell scripting and Kubernetes.
In this recipe, I demonstrate how to connect to IBM Cloud Private and configure a Chaos Monkey script. This script randomly kills pods in your cluster so you can introduce concepts such as chaos testing early in your DevOps process.

Ingredients

To get started, you should have an elementary understanding of Kubernetes and have installed IBM Cloud Private, Docker, and jq locally. In this tutorial, I will demonstrate some basic concepts around security by authenticating with IBM Cloud Private, interacting with secured Kubernetes APIs, and running this script in your development environment. To wrap up the article, I will also show how to deploy this script as a Kubernetes resource.

IBM Cloud Private Installation Guide: https://www.ibm.com/support/knowledgecenter/en/SSBS6K_2.1.0/installing/install_containers_CE.html
Docker Install: https://docs.docker.com/install/
jq Install: https://stedolan.github.io/jq/download/

Obtaining a security token

IBM Cloud Private’s security model is based upon Open ID Connect (OIDC). IBM Cloud Private provides a REST API for obtaining security tokens based upon successfully authenticating the user with the appropriate user id and password. IBM Cloud Private makes it easy to consume identity tokens via this API. Client apps receive the user’s identity encoded in a secure JSON Web Token (JWT), called an ID token. JWTs are appreciated for their elegance and portability, and for their ready support for a wide range of signature and encryption algorithms. All that makes JWT outstanding for the ID token job. In the code below, we authenticate and receive the JSON payload as the result. Using jq makes parsing the response really simple because it uses dot notation to traverse the JSON response body. In this example, id_token is one of the root elements and is accessed directly using the .id_token parameter

In the code snippet above, I am setting the variable TOKEN to store the token associated with the IBM Cloud Private deployment. This token defined by the $HOST parameter with the credentials associated with the user $USER and the password $PASSWORD. To verify that the call completed successfully, I am echoing the result of the token to the console.

Retrieving a list of pods for a namespace

Now that I have obtained the token, I can now query Kubernetes to retrieve various types of resources including Kubernetes Pods. IBM Cloud Private requires any API calls to Kubernetes resources to be protected by Role Based Access Control (RBAC). In the code snippet below, I retrieve the list of pods for the default namespace, where I have deployed a set of test resources.

This code snippet defines a JSON payload that contains a list of items. . Using jq, I am converting the list of items into a set of entries and, again via dot notation, drilling down to the name defined in the metadata. Once I have the list of podNames, I calculate the length of the result set representing the number of pods in the namespace. This will be used in the next section.

Deleting random pods chaos style

With the size and list of pods stored in my script variables, we can start to play a little bit with the pods by randomly deleting them. Using this chaos style testing, we can randomly generate a number less than the number of available pods and target the pod stored in that element for deletion. Using the same security token for retrieving the pods, we can now make a HTTP DELETE REST API to delete the selected pod in the given namespace.

Once this code finishes running, the API returns the POD that is marked for deletion and logs the pod to the console. If you are monitoring the IBM Cloud Private dashboard during this process, you will see the pod get marked for deletion, and a second pod will be created to replace the deleted pod.

Deploying Chaos Monkey on IBM Cloud Private

Now that I have the basics in place, we can modify this script to run it on IBM Cloud Private. For the first step, I want to move from a one-time execution of this script to a forever style script that constantly runs chaos against a specified namespace. To keep the script simple, I wrapped a do/while loop around the various calls and named it chaos.sh.

I then created a Dockerfile to pull in the various script dependencies and the script into the Docker image.

For testing purposes, I had a simple deployAppDocker.sh script. You will want to update the references in the next two sections to point to your Docker registry.

Once I built and deployed the image to my local docker registry, I am now able to refer to this image from my deployment.yaml. Note that userid and password are hardcoded in the deployment.yaml. In a real world application, these would be defined as Kubernetes Secrets instead. However, to keep the article simple, we put default values in the deployment resource to keep compatibility with the original script.

In the deployment.yaml, I specified I want to randomly kill pods that are deployed in the namepace default. To avoid killing the chaos deployment, I created a new namespace for this resource named chaos and will deploy the chaos monkey code into that namespace. This will provide the requisite isolation between the code doing the chaos testing from the resources that are being tested.

Once I set the default namespace for my command line, I deployed my resource by running the kubectl apply -f deployment.yaml command. We now have the same script that was running locally running inside of IBM Cloud Private!

Conclusion

In order to verify if a container in a pod is healthy and ready to serve traffic, Kubernetes provides for a range of health checking mechanisms. Health checks, or probes as they are called in Kubernetes, are carried out by the kubelet to determine when to restart a container (for livenessProbe) and by services to determine if a pod should receive traffic or not (for readinessProbe). While these are great to check on the overall health and responsiveness of resources, what happens to the running system when one of the resources randomly fails.

Originally published at developer.ibm.com.

--

--

Todd Kaplinger
IBM Cloud

Vice President SW Engineering — Chief Architect, Retail Solutions@NCR Voyix. The opinions expressed here are my own. Follow me on Twitter @todkap