How We Deal with a Google Kubernetes Engine (GKE) Metadata Server Outage

Why the GKE metadata server failed to work, and how we fixed it

Able Lv
Airwallex Engineering
5 min readJun 7, 2022

--

Photo by Venti Views on Unsplash

Overview

Someday, after our Google Kubernetes Engine (GKE) cluster was updated automatically, the GKE metadata server went down for nearly a day. As a result, the kube-dns (Kubernetes DNS service) Pods kept restarting, and services in the cluster were unavailable.

This post will detail the outage, explain what caused it, and describe how we troubleshot and solved it.

What happened

We have been running our GitLab runners on the GKE cluster for two years, and we normally don’t worry about the parts that are fully managed by Google Cloud.

But recently, we have experienced an outage due to the GKE metadata server failing after upgrading the cluster. The GKE metadata server is a hosted component of GKE to provide Compute Engine metadata.

All our developers were experiencing an issue with our GitLab runners. Due to the GKE metadata server failure, the kube-dns Pods kept restarting with the following error message:

kube-dns error message

The Kubernetes DNS service was down, which caused a lot of CI/CD jobs to get stuck or fail.

Mitigating the issue

We recovered the DNS service by deleting the prometheus-to-sd container from the kube-dns deployment. The kube-dns Pods were up and running, and GitLab runners immediately went to work. However, we still encountered the following error during building Docker images:

unauthorized error when building Docker images

Our Docker images are hosted on Google Container Registry (GCR). The docker-credential-gcr helper is used to make authenticated requests to GCR, but it failed to fetch credentials for the VM from the GKE metadata server.

Identifying the root cause

More team members were involved in discussing and investigating the issue.

Firstly, we noticed that the GKE metadata server’s IP address 169.254.169.252 was assigned to the loopback interface, as shown below:

Secondly, we observed that the gke-metadata-server version was outdated in comparison to other clusters. The version of our gke-metadata-server was 20200626, but the version of other clusters was 20220301. Moreover, our gke-metadata-server was listening on 127.0.0.1:998, whereas the address in other clusters was :::988, as follows:

Therefore, we confirmed the possible root cause of the issue: Google Cloud did not upgrade our gke-metadata-server, which caused a compatibility issue.

From the GKE release notes, the GKE metadata server address was changed from 127.0.0.1:988 to 169.254.169.252:988 from version 1.21.0-gke.1000.

According to the following iptables rules on the node, requests to 169.254.169.254:80 were forwarded to 169.254.169.252:988. Because the gke-metadata-server was only listening on 127.0.0.1:988, requests to 169.254.169.252:988 failed without a doubt. It seems we found the root cause.

Lastly, we reported the above root cause to Google Cloud Support.

Resolving the issue

Two hours have passed since we reported the root cause to Google Cloud Support. Our team members were concerned that Google Cloud Support would not be able to resolve the problem quickly.

Can we try to disable and enable the GKE metadata server to see if it will be upgraded to the latest version?

It worked. We deleted the gke-metatada-server DaemonSet manually, and the GKE re-created it with the latest version. The GKE metadata server was then up and running, and all GitLab runners were working.

Root cause

From the logs, the issue began when the GKE nodes were upgraded from 1.20 to 1.21. According to the GKE release notes, from GKE 1.21+, the GKE metadata server address was changed from 127.0.0.1 to 169.254.169.252.

However, Google Cloud failed to upgrade our cluster’s gke-metadata-server because it was a very old version. Thus, it was still listening on the old 127.0.0.1 IP address. Existing deployments attempted to connect to the 169.254.169.252 IP address (as they should be from 1.21+) but failed with a connection refused error.

Lessons learned

  • Having a backup plan is essential.
  • Set up a monitoring system and send an alert when there are many failed GitLab CI/CD jobs.
  • Periodically check the GKE release notes for announcements about known issues, new features, bug fixes, and deprecated functionality.
  • Though the gke-metatada-server is managed by Google Cloud Platform (GCP), we still need to know how it works and how to debug it. Every node in a GKE with Workload Identity enabled stores its metadata on the GKE metadata server. The GKE metadata server runs as a DaemonSet, with one Pod on every node in the cluster.
  • Be proactive in problem-solving, even if it’s not something you cover. This may accelerate the whole process and achieve a good result in the end. This time, our team worked together to identify and solve the issue.

This was an interesting troubleshooting and learning experience. Hopefully, you’ll find it useful!

Reference

Thanks to Infrastructure & Productivity Team.

Able Lv is a Senior DevOps Engineer at Airwallex.

--

--

Able Lv
Airwallex Engineering

Cloud Infrastructure Engineer @Airwallex: Kubernetes, DevOps, SRE, Go, Terraform, Istio, and Cloud-Native stuff