Agent of Chaos: Introduce a little anarchy to make your application on Kubernetes more resilient

Ali Mukadam
Oracle Developers
Published in
7 min readMay 16, 2023

I was discussing the previous series of articles on Coherence with my colleague Shaun Levey and his response was: “It’s great you can measure and monitor across different regions but I want to know how it responds when faced with failures. Can we inject latency and see what happens?” His reasoning is that if you can identify as many possible failure scenarios in advance, you can pre-create your dashboards and alerts ahead of time to warn you instead of having to go and figure out what PromQL to use when “Code Red” has been issued. You can also come up with pre-tested workarounds to minimize the impact or even better, address systemic weaknesses (better infrastructure, more resilient application, process, humans) as to respond.

Of course, there’s no way you can predict every possible failure scenario. However, you can at least prepare for some of them. The more you can prepare for, the greater the confidence you have in your system’s ability to handle failures.

Shaun’s comment (without the theatrics and the make-up) reminded me of The Joker’s infamous line in The Dark Knight:

and subsequently of chaos engineering. So what is chaos engineering?

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

There’s a formal methodology behind it but the gist of it is this:

  1. You think of a possible cause of failure.
  2. You deliberately cause the failure to happen.
  3. If your system withstand the failure, you are good. Or maybe you can find other things to improve too.
  4. If your system does not withstand the failure, you think of ways to minimize its impact or prevent the failure.
  5. Then you test again.

Since the entire Coherence cluster is running on Kubernetes (OKE), we can start by causing failures in Kubernetes and see what happens. There are a number of chaos engineering tools, some purposefully built for K8s and others for much wider use cases. For the purpose of this experiment, I decided to pick Chaos Mesh. Why Chaos Mesh? Well, it’s a CNCF project, reasonably well documented and with a nice UI.

Let’s take it for a spin. But before chaos-testing the entire 3-region Coherence cluster, let’s take baby steps by creating a new OKE cluster, deploy Coherence and all the usual snazz (Prometheus, Grafana etc). We’ll use this approach to get familiar with the tool first and to figure out how to cause some chaos in a Coherence cluster running on Kubernetes.

Installing Chaos Mesh

As OKE now uses CRIO as container runtime, we need to let Chaos Mesh know during installation:

helm repo add chaos-mesh https://charts.chaos-mesh.org
helm install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-mesh --set chaosDaemon.runtime=crio --set chaosDaemon.socketPath=/var/run/crio/crio.sock --version 2.5.2 --create-namespace

Wait for the the pods start:

kubectl get pods --namespace chaos-mesh -l app.kubernetes.io/instance=chaos-mesh

And access the dashboard:

kubectl -n chaos-mesh port-forward svc/chaos-dashboard 2333:2333

And we can then access it on the browser: http://localhost:2333 where you’ll be prompted to authenticate:

Click on the link to generate the token and follow the instructions:

Set the scope to cluster, role to manager, save the contents above to rbac.yaml and apply:

kubectl apply -f rbac.yaml

Since our version of OKE is also greater than 1.24, we need to create the token manually. The instruction will be given to you on the popup e.g.

kubectl create token account-cluster-manager-puuty

Copy the token name above and login with it and the generated token:

Once logged in, you’ll be able to access the dashboard:

Testing chaos mesh with a simple pod kill

Let’s try a simple pod kill:

Here, we’ll destroy only 1 pod:

kind: PodChaos
apiVersion: chaos-mesh.org/v1alpha1
metadata:
namespace: chaos-mesh
name: simple-pod-kill
spec:
selector:
namespaces:
- coherence-chaos
labelSelectors:
statefulset.kubernetes.io/pod-name: my-deployment-0
mode: all
action: pod-kill

We can see the immediate effect in Grafana:

1 member recently departed, there was a temporary dip in the member count, then a new member joined. All is well.

Simulating a network attack

We now wish to know what kind of latency will cause Coherence to misbehave and lose members or data. Let’s now generate some load by running a peformance test. For this, we’ll use chaos mesh to create a network attack:

We want to cause a delay of 1s with a jitter of 200ms and 100% correlation:

We also want to select our pods into which to inject the delay:

Finally, create the experiment info and we specify we want to run the experiment for 2 mins:

In Grafana, on the Coherence Member page, we start to see some effect of the network delay:

There’s a small dip in publisher success rate:

But this delay is not significant to cause Coherence members to be kicked out. We iteratively keep increasing the parameters until landing at the following:

  • delay: 20s
  • jitter: 3s
  • duration: 5m

We can now see Coherence members start getting ejected out of the cluster.

Coherence members getting ejected

On the Members Summary dashboard, we also see there’s only 1 pod reporting for duty:

Likewise, we can see gaps starting to form on the publisher and receiver success rates.

When the experiment finishes, we can see Coherence gradually nursing itself back to health. ChaosMesh experiments can also be defined as POKOs:

kind: NetworkChaos
apiVersion: chaos-mesh.org/v1alpha1
metadata:
namespace: coherence-chaos
name: simple-coherence-delay-20s
spec:
selector:
namespaces:
- coherence-chaos
labelSelectors:
coherenceCluster: my-deployment
pods:
coherence-chaos:
- my-deployment-0
- my-deployment-1
mode: all
action: delay
duration: 5m
delay:
latency: 20s
correlation: '100'
jitter: 3s
direction: to
target:
selector:
namespaces:
- coherence-chaos
labelSelectors:
coherenceCluster: my-deployment
pods:
coherence-chaos:
- my-deployment-2
mode: all

This means you can check this into your git repo and automate it with your preferred CD tool e.g. Argo CD, flux etc.

Chaos Testing the global Coherence cluster

Recall that we have the following deployment for the global Coherence cluster:

We now want to simulate losing an entire region e.g. Amsterdam and see how the global Coherence cluster responds after this . My hypothesis is the following:

  • When a failure happen in Amsterdam, its Coherence members will get ejected from the global Coherence cluster.
  • We expect to see only member departures after starting the experiment.
  • Only Coherence Members in Paris and Frankfurt should be present in the cluster.
  • When the experiment concludes, we expect the 4 Amsterdam members to rejoin the global cluster.

We’ll simulate the failure in Amsterdam using pod failures:

In Grafana, on the Coherence Main dashboard, we can see the effect of pod failures by members departing the cluster. When the test is completed, we see them join back and the cluster count going up again.

On the Members Summary, when the test is still running, we see only Coherence members from Paris and Frankfurt.

Of course, this is not a full resilience test. But it shows that when faced with this particular type of failure, the global cluster can still keep functioning and when the affected region (Amsterdam) has recovered, its members can rejoin the global Coherence cluster, at least for this type of failure.

Conclusion

In this article, we used chaos engineering to test the resilience of Coherence within a Kubernetes (OKE) cluster when faced with significant latencies. We then introduced a failure into the global cluster to test its resilience to simulate losing an entire region. Finally, we used the usual suspects (Prometheus and Grafana) to monitor how Coherence fares during failure and recovery.

I hope you find this article useful. I would like to thank my colleagues Shaun Levey, Tim Middleton and Avi Miller for their insights and contributions to this article.

References:

  1. Curated list of chaos engineering resources: https://github.com/ppcano/awesome-chaos-engineering

If you’re curious about the goings-on of Oracle Developers in their natural habitat, come join us on our public Slack channel!

--

--