Whack-a-pod: The Kubernetes cluster whack-a-mole game
Earlier this week, I released an open source version of “Whack a Pod,” a demo that my team at Google Cloud have been using at Google Cloud Next, Google I/O and various regional events. For those that haven’t seen the demo, it turns a Kubernetes cluster into a Whack-a-mole game, where Kubernetes pods are the moles, and you are trying to take down enough pod/moles to disrupt the service those pod/moles are serving up.
The versions we have used at our events include a physical whack-a-mole machine that was hooked up to our game, so that you could actually physically kill Kubernetes pods by swinging a hammer at them. The work on the physical rig was done by Sparks and is not included in this repo. But, you can run this version anywhere, with minimal hardware requirements — a screen and an interface, touchscreen or mouse based.
Why Build it?
I wanted an easy and fun way to explain Kubernetes. I wanted something that could be hooked up to a real thing, and allow you to touch, as it were, something in the cloud. I wanted to get across the idea of Kubernetes being resilient.
I also wanted to make a series of carnival-based games, all with the same look and feel as the carnival version of whack-a-pod. They were shot down for being too whimsical. But that’s another story.
How does it work?
The entire application consists of three separate applications that are all hosted on the same Kubernetes cluster. We also create three services with which to expose the applications.
This is the application that is represented by the moles. We launch a deployment that creates a replica set with instructions to keep 12 of them running at all times.
This is the basic service that the pods are keeping up. It is tremendously simple — when polled it returns a random hexadecimal color value.
This is a slightly tweaked version of the color api above for use with the advanced interface. In addition to the color, it returns the unique Kubernetes generated name of the pod that answered the request.
This is a set of commands that allow our front-end to issue commands against the Kubernetes cluster without need for credentialing. It is basically a proxy to the Kubernetes API with a restricted set of actions possible.
Creates a deployment for running the pods that serve up the API application. Used in all interfaces when you start up.
Deletes all pods for the API deployment.
Deletes the deployment for the API application. Used in all interfaces when the game is finished with the deployment
Deletes a single pod. Used in all interfaces when you whack a pod.
Cordons a node to prevent any pods being scheduled on it, then it kills all the API pods that are running on the node. Used in the advanced interface.
Gets information about the nodes of the Kubernetes cluster. Used in the advanced interface.
Gets information about all of the pods running the API service. Used in all of the interfaces to populate the list of pod/moles.
Resets a node so that it can start accepting newly scheduled pods. Used in the advanced interface.
The game consists of a few separate HTML/JS/CSS apps joined together. All of them work in the same general way.
The demo starts with no pods running. Each version will prompt you to deploy the pods. Deploying creates a replica set with 12 pods running.
The UI regularly polls the api/color service. If it gets a result, the service is up, if it doesn’t get a result, the service is down. There is an indicator towards the top that gives the player feedback on service status.
The pods are displayed in a grid, their status is indicated by color differences (or mole position differences). The statuses are: starting, running, terminated. Starting and terminated pods cannot be “whacked.” When you whack a running pod, the UI calls admin/api/k8s/deletepod/. The pods remain for awhile after they have been terminated, and are replaced by pods in the started state.
This is the basic game. It has a fun carnival theme, and is designed to be more of a fun, distracting game than a real lesson about Kubernetes.
This is the basic game but without the carnival theme. It’s more in line with the branding for the Google Cloud Next events. We have used this version on a touchscreen at a few of our regional Cloud Next events.
It also has a panel that displays an abridged version of the JSON response from Kubernetes commands. Why abridged? Because most of the single responses from Kubernetes are over 100 lines of formatted JSON. The app instead shows the salient details.
When showing the demo at various events, we found ourselves wanting another view of the information for when conversations started to go deeper. The advanced view is a response to that. It gets rid of the time element, and instead displays the pods as they populate the cluster nodes, and not in a fixed grid. We show the service responses directly. We also show off which pod is actually responding to the service request. The interface includes the ability to drain a cluster node to simulate the node going down. Killing an actual node takes much longer, so this seemed like a reasonable way to simulate node death.
Why three services?
I originally wrote it as one app. When you killed all of those pods, and disrupted the service, you killed the UI too. This combined with inconsistent caching behavior caused really odd issues when you ran the demo. It made much more sense to split them up so as to not kill the game while you were playing the game.
For starters I do outreach for Google Cloud to the PHP community. But also I started writing this as a quick and dirty prototype. When I want something done quickly, I write it in PHP. It’s very productive for me. I used the Docker image for App Engine flexible environment’s PHP runtime to just get the thing to work.
When the time came to tighten up everything, I considered re-writing all of the apis to use Golang instead. But one of the side effects of choosing PHP and the GAE flexible runtime was that there is a little overhead to starting the service — as opposed to writing a lean go app that only does just one thing. That overhead is only a couple of seconds, but having it allows the demo to illustrate the full lifecycle of a pod.
Why No Public Demo?
I haven’t figured out a way to convert this demo to a multi tenant demo. So it’s one front-end client to one Kubernetes Cluster. So multiple players would interfere with each other. Right now it involves trying to take down a service directly tied to a specific IP. I’m sure there is a way to rewrite it to do so, I just haven’t had the time to do it.
What I learned
Kubectl is awesome
In a lot of cases, I was just trying to recreate a kubectl command to tie to the front-end. Specific examples are kubectl delete deployment, and kubectl drain. In both cases, these are actually doing a lot of work, but hiding it behind one command. In the case of kubectl delete deployment, the command is deleting the deployment, then the replica set, then each pod and making it all one thing. If you just delete the deployment via the API, all of the children remain — and if you aren’t expecting it, you’ll be confused.
The fact that kubectl can be called with a flag to reveal those underlying calls is very much appreciated. The flag is “-v=8” in case you need it.
Kubernetes is a weeble not a fortress
I think this was one of the more surprising things I learned. I didn’t go out of my way to make the system super resilient or super brittle, but under most conditions if you were able to directly delete all of your Pods, you can cause major outages for your services. However, those outages would seldom be very long. In fact, at Google I/O we were tracking how long people kept the service down, the most we saw was 50% downtime. And this was two very motivated people hitting moles as soon as they appeared. Most of the time we saw downtown of less than 30%. Again, keeping in mind that the point of the game mechanics is to cause as much downtime as possible.
A dead Pod is only mostly dead
At a few events we ran into issues where none of the visible pods were in a “running” state, but the service was still up. I thought it was due to UI weirdness. Turns out it was due to the fact that running the Kubernetes delete pod command marks the pod as “terminating” and then allows for graceful termination. So if request gets routed to them, and they are still responding to requests, they can still return a result, even if marked terminated. I would have known it already if I had bothered to read the documentation a little more carefully.
Whack-a-mole machines rest when the moles are down
When the moles are up, the machines are doing work, when they are down the machine is resting. When Sparks first hooked up the physical whack-a-mole machine to the Kubernetes cluster (virtually — one is in our data center, the other at the event) the moles were up all the time, because, well, that’s what Kubernetes does. Long story short, our first whack-a-mole machine burnt itself out, because when 21st century cloud technology collides with 1970’s arcade technology, the cloud wins.
This was a really fun project. It helped me learn a lot about Kubernetes, while at the same time enabling a great educational experience about Kubernetes. Plus I got people to build a whack-a-mole machine. It was pretty awesome.