Troubleshooting services on GKE

Published in

Google Cloud - Community

4 min readNov 3, 2020

In my last post, I reviewed the new GKE monitoring dashboard and used it to quickly find a GKE entity of interest. From there, I set up an alert on container restarts using the in-context “create alerting policy” link in the entity details pane. This time, I wanted to have a go at troubleshooting an incident using this setup.

The setup

The app

You can see the full code for the simple demo app I’ve created to test this here. The basic idea is that it exposes two endpoints — a / endpoint, which is just a “hello world”, and a /crashme endpoint, which uses Go’s os.Exit(1) to terminate the process. I then created a container image using Cloud Build and deployed it to GKE. Finally, I exposed the service with a load balancer.

Once the service was deployed, I checked the running pods:

✗ kubectl get podsNAME READY STATUS RESTARTS AGErestarting-deployment-54c8678f79-gjh2v 1/1 Running 0 6m38srestarting-deployment-54c8678f79-l8tsm 1/1 Running 0 6m38srestarting-deployment-54c8678f79-qjrcb 1/1 Running 0 6m38s

Notice that RESTARTS is at zero for each pod initially. Once I hit the /crashme endpoint, I saw a restart:

✗ kubectl get podsNAME READY STATUS RESTARTS AGErestarting-deployment-54c8678f79-gjh2v 1/1 Running 1 9m28srestarting-deployment-54c8678f79-l8tsm 1/1 Running 0 9m28srestarting-deployment-54c8678f79-qjrcb 1/1 Running 0 9m28s

I was able to confirm that each request to the endpoint resulted in a restart. However, I had to be careful to not do this too often — otherwise, the containers would go into CrashLoopBackOff, and it would take time for the service to be available again. I ended up using this simple loop in my shell (zsh) to trigger restarts when I needed them:

while true;  do    curl http://$IP_ADDRESS:8080/crashme;    sleep 45;done

The alert

The next step was to set up the alerting policy. Here is how I configured it:

I used the kubernetes.io/container/restart_count metric, filtered to the specific container name (as specified in the deployment yaml file), and configured the alert to trigger if any timeseries exceeded zero — meaning if any container restarts were observed.

The setup was done — I was now ready to test and see what happens!

Testing alert

When I was ready, I started the looped script to hit the /crashme endpoint every 45 seconds. The restart_count metric is sampled every 60 seconds, so it didn’t take very long for an alert to show up on the dashboard:

I moused over the incident to get more information about it:

Already, this is an improvement over the previous version of this UI, where I couldn’t interact with the incident cards.

I then clicked on “View Incident”. This took me to the Incident details screen, where I could see the specific resources that triggered it. In my case, it was pointing to the container:

I then clicked on View Logs to see the logs (in the new Logs Viewer!) — and sure enough, it’ was immediately apparent that the alert was triggered by the containers restarting:

This is all very nicely tied together and makes troubleshooting during an incident much easier!

In summary….

I’m a big fan of the new GKE dashboard — I really like the new alerts timeline, I like that the incidents are clearly marked and that I can actually interact with them to get the full details of exactly what happened, all the way down to the container logs that tell me the actual problem.

Thanks for reading, and come back soon for more. As always, please let me know what other SRE or observability topics you’d like to see me take on. And now more than ever — stay healthy out there!