Observing the Impact of Swapping Nodes in GKE with Chaos Engineering

Trust In Face of Uncertainty With the Chaos Toolkit

Kubernetes is a fantastic “platform” to rely upon for your system’s safety and availability. Yet, Kubernetes clusters don’t live in the ether, someone still needs to provide and care for the underlying infrastructure. For instance, the virtual or physical nodes the cluster is running on.

Why would you need to change an infrastructure node however?

  • To benefit from a newer Kubernetes release
  • The OS has a security issue that requires an update
  • You want to refresh or change the service account associated with it

Confidence and Trust

If your system lives on the Google Cloud Engine platform and you manage your Kubernetes clusters with Google Kubernetes Engine (GKE), you can enable the automatic upgrading of nodes or manually upgrade them. Either way, these operations can make anyone nervous. Even if you have carried it before, the only confidence you can demonstrate is in the potential risks involved.

GKE does simplify the operation however. It will create a new nodepool, deploy Kubernetes nodes onto them and join them to the cluster before draining the old nodes, so that everything moves to the new Kubernetes nodes. In other words, GKE sensibly creates new resources before removing old ones to decrease the impact on your system’s availability.

With that said, while automation is there to support you, you still need to fully understand the impact of such operation.

While automation brings confidence, your system needs to earn your trust.

This is why Chaos Engineering is a fantastic asset in your toolbox. The whole game of that practice is to allow you to make sense of the system by guiding you through its complexity and fluidity.

In this article we’ll review how we can use Chaos Engineering with the Chaos Toolkit, the Open Source interface for declarative and controlled experiments.

The Chaos Toolkit is Open Source and drives many platforms already

Our System

We assume we are running a simple application inside the Kubernetes cluster, nothing fancy. Our users simply talk to that application over HTTP (or HTTPS) by POST some JSON payload to fetch some JSON content in return. No database or external services are involved. In other words, our system is not really complex from an application standpoint.

We have one nodepool made of three nodes:

$ gcloud container node-pools list — cluster demos-cluster
NAME MACHINE_TYPE DISK_SIZE_GB NODE_VERSION
other-pool n1-standard-1 100 1.9.6-gke.0

Our Hypothesis

We set the following hypothesis: Switching to a new GCE nodepool should not impact our system availability and users should not see any errors under a moderate load.

With the Chaos Toolkit, this is declared as follows:

"steady-state-hypothesis": {
"title": "Function is available",
"probes": [
{
"type": "probe",
"name": "function-must-exist",
"tolerance": 200,
"provider": {
"type": "http",
"timeout": [3, 3],
"secrets": ["faas"],
"url": "http://demo.foo.bar/system/function/astre",
"headers": {
"Authorization": "${auth}"
}
}
},
{
"type": "probe",
"name": "function-must-respond",
"tolerance": 200,
"provider": {
"type": "http",
"timeout": [3, 3],
"secrets": ["global"],
"url": "http://demo.foo.bar/function/astre",
"method": "POST",
"headers": {
"Content-Type": "application/json",
"Authorization": "${auth}"
},
"arguments": {
"city": "Paris"
}
}
}
]
}

Our Experiment

Using the Chaos Toolkit Google Cloud driver, we will be creating a new node, drain the old one and see if we meet our hypothesis still. We will not delete the old cluster in this experiment but instead delete the new node pool during the rollback phase. This is merely a demo.

During the whole time, we will be using Vegeta to perform, in the background, a moderate load against our system.

"method": [
{
"type": "action",
"name": "simulate-user-traffic-under-moderate-load",
"background": true,
"provider": {
"type": "process",
"path": "vegeta",
"arguments": "attack -targets=data/scenario.txt -workers=5 -rate=10 -timeout=3s -duration=90s -output=result.bin"
}
},
{
"type": "action",
"name": "create-a-new-nodepool-and-swap-to-it",
"provider": {
"type": "python",
"module": "chaosgce.nodepool.actions",
"func": "swap_nodepool",
"secrets": ["gce"],
"arguments": {
"old_node_pool_id": "other-pool",
"delete_old_node_pool": false,
"new_nodepool_body": {
"nodePool": {
"name": "yet-other-pool",
"management": {},
"initialNodeCount": 1,
"version": "1.9.6-gke.0",
"config": {
"diskSizeGb": 100,
"imageType": "COS",
"machineType": "n1-standard-1",
"oauthScopes": [
"https://www.googleapis.com/auth/devstorage.read_only",
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
"https://www.googleapis.com/auth/service.management.readonly",
"https://www.googleapis.com/auth/servicecontrol",
"https://www.googleapis.com/auth/trace.append",
"https://www.googleapis.com/auth/compute"
],
"serviceAccount": "default"
}
}
}
}
}
}
],

As mentioned, we will be using the Chaos Toolkit rollback capabilities to schedule the old cluster again and delete the new node pool.

"rollbacks": [
{
"type": "action",
"name": "uncordon-old-nodepool",
"provider": {
"type": "python",
"module": "chaosk8s.node.actions",
"func": "uncordon_node",
"arguments": {
"label_selector": "cloud.google.com/gke-nodepool=other-pool"
}
},
"pauses": {
"after": 20
}
},
{
"type": "action",
"name": "delete-new-nodepool",
"provider": {
"type": "python",
"module": "chaosgce.nodepool.actions",
"func": "delete_nodepool",
"secrets": ["gce"],
"arguments": {
"node_pool_id": "yet-other-pool"
}
}
}
]

Running the Experiment

Running the experiment is simple with the Chaos Toolkit, it takes the experiment declaration and… well that’s it.

We are good here, our hypothesis was confirmed. Either we assumed the wrong normal or we really don’t have an impact by switching nodes. Although, looking at the generated report, we actually notice a few 502 reported errors. They could be outliers or reveal an actual issue we would need to look into.

Observe your System

While the Chaos Toolkit Open API defines probes you can use to query your system while the experiment runs (and build a useful report afterwards), it is good to observe your system. In this demo, we use Weave Cloud from WeaveWorks which offers a fantastic near real-time view of the Kubernetes cluster.

You can notice our new node coming up and then going away.

Going Further

In a world where systems are highly dynamic and growing in complexity, where change is the common currency, Chaos Engineering is a powerful asset for teams and organisations to continue keeping the trust in check.

In this context, the Chaos Toolkit supports your effort by driving platform capabilities to induce stressful conditions that you can analyse and learn from.

As part of KubeCon Europe 2018, I will be demoing the Chaos Toolkit in various scenarios.

Please join us on the Chaos Toolkit slack to continue this discussion and tell us about your own chaos engineering stories.

All the code for this demo (and others along those lines) can be found here.