Dealing with terminated pods in GKE clusters using non-standard provisioning models

Published in

MediaMarktSaturn Tech Blog

7 min readMar 24, 2023

Recently, Kubernetes introduced the “graceful node shutdown” feature. For Google Kubernetes Engine (GKE) managed clusters this feature has been enabled by default. This leads to terminated Pods being kept on the cluster and polluting the overview of deployed workloads in certain scenarios. This article describes one of these scenarios that we faced in our setups and explains a little gadget we built and published to tackle this issue.

Background

As described in our Terraform MediaMarktSaturn article our resources in the cloud are mostly running on the Google Cloud Platform (GCP). A service that is used heavily for running our applications in GCP is the Google Kubernetes Engine (GKE), Google’s managed Kubernetes offering.

Kubernetes offers a so-called “graceful node shutdown” feature, which has been in the beta state since v1.21 (at the time this article was written, it still was in beta). In short, this feature describes that whenever a Kubernetes node is shut down, the node’s kubelet makes sure that the pods running on the node will be terminated “gracefully”. The goal of the graceful termination of a pod is being able to finish its running operations and cleaning up properly rather than just being killed on the spot (in the end, a pod simply represents one or more processes running on the cluster node).

In GKE, the graceful node shutdown has been enabled by default for nodes running on versions 1.20.5-gke.500 or later and there’s no real way of disabling it as Kubernetes control components are not configurable in the managed GKE setup. As of today, the version 1.20.5-gke.500 is basically not available anymore in GKE (at least it is not available when using the GKE release channels, which is a recommended best-practice and widely implemented in our organization). Because of this, the majority of our GKE clusters have the graceful node shutdown enabled.

Side effects when using preemptible or spot clusters

GKE nodes are always provisioned as Google Compute Engine virtual machine (VM) instances (unless you are using the Autopilot mode to which however this entire article does not apply). Besides using standard nodes for clusters, GKE also offers the use of preemptible or spot VMs (both provisioning models are similar, spot VMs are basically the successor of preemptible VMs).

Preemptible or spot nodes do not run 24/7, Google is allowed to terminate the VMs whenever the capacity is needed somewhere else (preemptible VMs do not live longer than 24 hours, for spot VMs this limit does not apply). The main benefit of using these provisioning models are the reduced costs. Google estimates the discount of spot and preemptible VMs (both use the same pricing model) from 60% up to 91% on the price of standard VMs/nodes.

Because of this, preemptible or spot clusters are often used for non-productive environments in MMS GCP projects. As workloads are not expected to be up and running all the time in non-production environments, the recreation of the nodes does not have any negative impact.

Whenever a node is shut down due to its preemptiveness, the graceful node shutdown is triggered that also gracefully deletes all pods running on the cluster. A new node will be provisioned by Kubernetes and the new pods will be scheduled on it.

During the node shutdown, its kubelet sets the phase of all Pods running on the node to Failed and the reason to Terminated. This complies with the Kubernetes standards as a Failed phase can be set if the system terminates a pod (in this case, Failed does not necessarily mean that the Pod termination actually failed). The status is also persisted in the YAML manifest of an affected Pod:

...
status:
  message: Pod was terminated in response to imminent node shutdown.
  phase: Failed
  podIP: 10.x.y.z
  podIPs:
  - ip: 10.x.y.z
  reason: Terminated
...

The Kubernetes garbage collection takes care of removing Terminated pods from the cluster. The threshold amount of terminated pods that can exist until the garbage collection kicks in can be set using the --terminated-pod-gc-threshold flag in the kube-controller-manager component. However for the GKE managed environment the control plane components (such as the kube-controller-manager) can not be accessed nor configured. GKE sets default values for the pod garbage collection that can be found here:

When the number of terminated Pods reaches a threshold of 1000 for clusters with fewer than 100 nodes or 5000 for clusters with 100 nodes or more, garbage collection cleans up the Pods.

This means that in GKE, up to at least 1000 terminated pods are kept by default. In preemptible or spot clusters nodes are often deleted and new ones are being created. This leads to a lot of pods being terminated over time. This can also be checked when e.g. listing pods of an existing Kubernetes deployment. Here’s an example screenshot of the Google Cloud Console that would make a certain Arnold Schwarzenegger very proud:

Implementing and using an automated Kubernetes Pod broom

Now, terminated pods do normally not have any negative impact on your cluster other than just being annoying when checking/displaying the running pods. As a sidenote, resource reservations of terminated pods do not have any effect on the cluster anymore, they are simply ignored.

However, explicitly removing these pods entirely from the cluster is still beneficial to just keep the cluster clean. The deletion can be triggered quite simply using kubectl. As written in the previous paragraph, Pods terminated by the system (GKE/GCP) will be marked as Failed. The phase is persisted in the YAML manifest of the pods. So let's use the --field-selector flag to select all Failed pods and delete them in one single command:

# Use -A or --all-namespaces to check and delete the pods across all namespaces of the cluster
kubectl delete pods -A --field-selector=status.phase=Failed

While this command is working perfectly fine, no one wants to always execute it in a shell manually. So let’s define a Kubernetes CronJob that runs the command on a regular schedule:

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: pod-cleanup
  namespace: default
spec:
  schedule: 0 6 * * * # Runs each day at 06:00 AM
  jobTemplate:
    spec:
      template:
        spec:
          automountServiceAccountToken: true
          serviceAccount: pod-cleanup-sa
          containers:
          - name: pod-cleanup-job
            image: bitnami/kubectl
            command:
            - /bin/sh
            - -c
            - kubectl delete pods -A --field-selector=status.phase=Failed
...

When applying the manifest, it creates a CronJob that periodically spins up a Kubernetes Job and runs our kubectl delete ... command. The only thing to make sure is to grant the specified Kubernetes service account (in this case pod-cleanup-sa) the necessary RBAC permissions to be able to delete pods across the cluster. The automountServiceAccountToken property guarantees that the kubectl command uses an access token associated with the specified pod-cleanup-sa account.

How to claim and run your own Pod broom

If you stumbled upon the same problem and look for an easy-to-deploy solution, we are happy to share that we published the approach described in this article as a Helm chart.

The chart is part of our public MediaMarktSaturn/helm-charts repository and can be found here. It deploys a Kubernetes CronJob running with a dedicated service account and appropriate permissions granted via a ClusterRole and a ClusterRoleBinding. The chart works perfectly fine when deploying it with its default values, however it can be configured (e.g. simply setting a different cron schedule). The CronJob will run the kubectl delete ... command periodically and remove all terminated/failed pods from the cluster it is running on.

The helm-charts repository also includes a small instruction section about how to add the Helm repository to a FluxCD setup or how to add it imperatively by running helm repo add ....

The chart can be deployed by creating a HelmRelease:

---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: pod-cleanup
  namespace: default
spec:
  interval: 10m
  chart:
    spec:
      chart: pod-cleanup
      version: "~1"
      sourceRef:
        kind: HelmRepository
        name: mediamarktsaturn
        namespace: flux-system
  values:
    schedule: 0 6 * * * # Each day at 06:00 AM

The chart can also be installed imperatively by running helm install pod-cleanup mediamarktsaturn/pod-cleanup --set schedule="0 6 * * *"

Summary

While the graceful node shutdown is a nice feature, it unfortunately has a side effect when using GKE clusters with preemptible/spot nodes. As it is not possible to configure the Kubernetes garbage collection in GKE, we had to find a custom solution to tackle this. A small centrally managed Helm chart is a nice and clean solution that teams across our organization can easily install in their clusters.

The approach can also be applied on other Kubernetes environments (managed or unmanaged), even though there might be other possibilities (as mentioned in this article, one can e.g. adjust the kube-controller-manager garbage collection configuration).

We hope you enjoyed this article, let us know if you faced similar problems like ours and how you tackled/solved them. Maybe there are other nice solutions out there!

get to know us 👉 https://mms.tech 👈