Chaos Engineering in Kubernetes using Chaos Mesh

Published in

Nerd For Tech

7 min readMay 22, 2021

Chaos Engineering in Kubernetes using Chaos Mesh

With your applications being migrated to the cloud, the architecture has become really complex. With such complex architecture of your applications in the cloud, it is really difficult to predict the failures. Any such failures in the application can cause an expensive outage to your company. These constant unpredictable outages can defame the companies reputation and might lose customers too. Companies must explore an option to predict such outages rather than fixing them during the next outage. And here comes Chaos Engineering to our rescue. Chaos Engineering is a disciplined approach to Identifying the failures before they cause an outage. With this approach, we deliberately break the application to find out how they react to failures. This will help us build resilient systems. Well, this is a really good approach. Is there something like this in the Kubernetes Ecosystem? Yes, Chaos Mesh to help us here. Chaos Mesh is a cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environments. It can be directly deployed to a Kubernetes Cluster and doesn't require any special dependencies. Chaos Mesh is a CNCF Sandbox project and has gained huge popularity in recent times.

What is the entire story all about? (TLDR)

Understand the architecture and concepts of Chaos Mesh.
Running Chaos experiments on a Kubernetes Cluster.

Prerequisites

A Kubernetes Cluster ( Can be either On-Prem, AKS, EKS, GKE, Kind ).
Helm, kubectl installed.

Story Resources

GitHub Link: https://github.com/pavan-kumar-99/medium-manifests
GitHub Branch: chaos-mesh

Components of Chaos Mesh

Chaos Mesh is comprised of the following components

a) Chaos Operator: This is the core component of chaos mesh.

b) Chaos Dashboard: This is a Web UI for designing, monitoring, and managing chaos experiments.

The architecture of Chaos Mesh

Chaos Mesh Architecture. Credits chaos-mesh

Chaos Mesh is deployed as a Daemon set in Kubernetes. The chaos-daemon controller is deployed as a daemon set pod on each of the worker nodes. Chaos Mesh uses Kubernetes CRD’s ( Custom Resource Definitions ) to define chaos objects.

Alright, let us get into action. We will create a GCE cluster using kOps.

$ export KOPS_STATE_STORE=gs://thanos-prod-medium$ export KOPS_FEATURE_FLAGS=AlphaAllowGCE$ kops create cluster \--node-count 1 \--zones us-central1-a,us-central1-b,us-central1-c \--master-zones us-central1-a \--container-runtime docker \medium.k8s.local$ kops update cluster --name medium.k8s.local --yes --admin$ kops validate cluster --wait=10m

Let us now install the chaos-mesh helm chart in our Kubernetes cluster.

$ helm repo add chaos-mesh https://charts.chaos-mesh.org$ helm repo update$ helm install chaos-mesh chaos-mesh/chaos-mesh --namespace=chaos-testing --create-namespace --set dashboard.create=true

Chaos Mesh

You should now find all the Chaos Mesh components created, and also a chaos daemon running on each of the worker nodes. Chaos Mesh also creates a dashboard to visualize and manage all the chaos experiments. The dashboard would look something like this.

Ahh, yes I get it !! The dashboard looks really cool, right? Without any further due let us jump into creating the Chaos Experiments. Well, there are various types of Chaos Experiments for Kubernetes. Let us check em all...

PodChaos Experiments

a) pod-failure: It injects the errors into the pods and will cause pod creation failure for a while. The selected pods will be unavailable in the specific period.

b) pod-kill: It kills the selected pods which would restart/respawn the pods constantly.

c) container-kill: It kills a specified container in the target pods.

NetworkChaos Experiments

a) Network Partition: This blocks the communication between two pods.

b) Network Emulation: Actions cover regular network faults, such as network delay, duplication, loss, and corruption.

StressChaos Experiments

a) Memory Stress: This will continuously stress virtual memory out.

b) CPU Stress: This stressor will continuously stress the CPU out.

TimeChaos Experiments

TimeChaos is used to modify the return value of clock_gettime

IOChaos Experiments

IOChaos allows you to simulate file system faults such as IO delay and read/write errors.

KernelChaos Experiments

Causes kernel chaos on the underlying host. Though it is targeted for a specific pod it might impact the performance of other pods. It is highly recommended not to run this in Production.

DNSChaos Experiments

a) DNSChaos: This allows you to simulate fault DNS responses such as a DNS error or a random IP address after a request is sent.

AWSChaos Experiments

This helps you to inject faults to AWS instances like ec2-stop, ec2-restart, detach-volume.

Keeping the length of this article into consideration, we will pick few chaos experiments and cover them here. But, feel free to comment if you want me to cover any other additional experiments.

Create the resources for testing.

We will create a couple of sample deployments to test this in Live.

$ kubectl create ns chaos-k8s $ kubectl config set-context --current --namespace=chaos-k8s$ kubectl create deploy httpd --image=httpd --replicas=2 $ kubectl expose deploy --port=80 $ kubectl create deploy nginx --image=nginx --replicas=2

PodChaos: Pod-Kill Experiment

Let us understand the yaml file before we apply this.

spec: The spec of the PodChaos
action: The kind of PodChaos to be applied. It could be pod-kill (or) pod-failure (or) container-kill.
mode: Defines the mode of the chaos. One indicates one replica at a time, all indicates all the replicas, and a value for a fixed percentage.
duration: The duration for each chaos experiment.
selector.labelSelectors: Specifies the target pods for chaos Injection.
scheduler: Defines the scheduler rules for the running time of the chaos experiment.

$ git clone https://github.com/pavan-kumar-99/medium-manifests \
-b chaos-mesh$ cd medium-manifests $ kubectl apply -f pod-kill.yaml

As soon as you apply the pod-kill manifests, you should see them visible in the chaos dashboard. In the Experiments tab. It also contains the last execution time.

We now see that the pods are being killed by our chaos daemon. And these pods will be killed one at a time every 2 minutes.

NetworkChaos: Network Partition Example

Now we have created 2 deployments.

httpd deployment.
nginx deployment.

We have exposed the httpd deployment over a Service of type ClusterIP. We will now try to access the httpd service from nginx deployment. Let us check If this works before we apply the network chaos.

$ pod=$(kubectl get po -l app=nginx -o \     jsonpath='{.items[0].metadata.name}')$ kubectl exec $pod -it -- /bin/sh -c "curl httpd"<html><body><h1>It works!</h1></body></html>

We should now see that the communication happens without any interruptions. Let us now apply the network chaos experiment.

This chaos experiment will block all the communication from the pods with the label “app: nginx” to all the pods that have the label “app:httpd” for 50 seconds.

$ kubectl apply -f network_chaos_partition.yaml$ kubectl exec $pod -it -- /bin/sh -c "curl httpd"curl: (7) Failed to connect to httpd port 80: Connection timed out

You should now see that the requests are being timed out.

DNSChaos Experiment

Let us now take a sneak peek at this manifest.

spec: The spec of the DNSChaos
action: The kind of DNSChaos to be applied. It could be either error ( or ) random. error throws an error when sending the DNS request. random sends a random IP when sending the DNS request.
patterns: Domain names to take effect for the DNSChaos
selector.labelSelectors: Specifies the target pods for chaos Injection.

$ pod=$(kubectl get po -l app=nginx -o \     jsonpath='{.items[0].metadata.name}')$ kubectl exec -it $pod -- /bin/sh -c "apt-get update -y && \
 apt-get install dnsutils -y" $ kubectl apply -f dns_chaos.yaml $ kubectl exec -it $pod -- /bin/sh -c "nslookup google.com"Server:  100.64.144.180
Address: 100.64.144.180#53** server can't find google.com: SERVFAILcommand terminated with exit code 1

Clean Up

Well, let’s delete the whole cluster now.

kops delete cluster medium.k8s.local — yes

Conclusion

With this, we have understood how Chaos Engineering will be used to build resilient systems. Chaos Mesh helps us to achieve this disciplined approach by allowing us to Implement chaos engineering in Kubernetes. Feel free to reach out to me for any new ideas or questions. Also, feel free to comment your thoughts in the comment section.

Until next time………..

Chaos Engineering in Kubernetes using Chaos Mesh

What is the entire story all about? (TLDR)

Prerequisites

Story Resources

Clean Up

Conclusion

Recommended

Deep Dive into Thanos-Part I

Monitoring Kubernetes Workloads with Thanos and Prometheus Operator

Deep Dive into Thanos-Part II

Monitoring Kubernetes Workloads with Thanos and Prometheus Operator

Kubernetes Security with Kube-bench and Kube-hunter

Security checks in Kubernetes Cluster Using Kube-bench and Kube-hunter

Creating Self Hosted GitHub runners in a Kubernetes Cluster

Run your GitHub actions on your own Kubernetes cluster

Written by Pavan Kumar