The “Snowball” Effect in Kubernetes

How one bad call can lead to hundreds of unwanted pods

Published in

The Telegraph Engineering

4 min readDec 1, 2022

Smooth sailing? GKE deployments don’t always go to plan

Kubernetes puts the mantra of “fail fast, recover faster” into practice: one thing fails, and another seamlessly pops up to replace it. But what happens when that guiding principle works against it? We’d like to share one such scenario we encountered in GKE, our findings and how we overcame it.

While we first encountered this issue some 2 years ago, I have seen it crop up on StackOverflow, so it’s not a rare occurrence.

I raised it in the Kubernetes project, and have also seen it referenced in other issues, too.

Symptoms

One day we noticed that Jobs and CronJobs in our cluster were acting strangely; multiple Jobs had spawned and sat there for more than an hour without creating any pods.

Digging into the Problem

Investigating other jobs, we found that a single CronJob in one of our namespaces was responsible for creating more than 900 pods. These pods had completed but had not been removed.

The CronJob was scheduled to run every minute, and the definition of the CronJob had valid values for the history — sensible values for .spec.successfulJobsHistoryLimit and .spec.failedJobsHistoryLimit were set. Even if they weren’t, the defaults would (or should) be used.

Why did we have over 900 cron pods, and why weren’t they being cleaned up upon completion? Just in case the number of pods were causing problems, we cleared out the completed pods:

kubectl delete pods -n {namespace} $(kubectl get pods -n {namespace} | grep Completed | awk '{print $1}' | xargs)

Even after that, new jobs weren’t spawning pods. In fact, more CronJob pods were appearing in this namespace. So we disabled (suspended) theCronJob

kubectl patch cronjobs -n {namespace} {cronjob-name} -p '{"spec" : {"suspend" : true }}'

That also didn’t help; pods were still being generated. Which is weird — why is a CronJob still spawning pods, even when it’s suspended?

The Lightbulb Moment

CronJobs generate Job objects when their schedules trigger. On checking the Job objects, we found more than 3,000 Job objects; a lot more than there should for something that runs once a minute. So we deleted all the CronJob related Job objects

kubectl delete job -n {namespace} $(kubectl get jobs -n {namespace} | grep {cronjob-name} | awk '{print $1}' | xargs)

This reduced the pod count, but did not help us determine why the other Job objects were not spawning pods.

Bringing In Reinforcements

We raised a support ticket with Google, who initially sent us this logs snippet (redacted):

2020-08-05 10:05:06.555 CEST - Job is created

2020-08-05 11:21:16.546 CEST - Pod is created

2020-08-05 11:21:16.569 CEST - Pod (XXXXXXX) is bound to node

2020-08-05 11:24:11.069 CEST - Pod is deleted

2020-08-05 12:45:47.940 CEST - Job is created

2020-08-05 12:57:22.386 CEST - Pod is created

2020-08-05 12:57:22.401 CEST - Pod (XXXXXXX) is bound to node

Spot the problem? The time between “Job is created” and “Pod is created” is about 80 minutes in the first case, and 12 minutes in the second one. It took 80 minutes for the Pod to be spawned.

It All Becomes So Clear…

This is where it dawned on me about what was possibly going on.

The CronJob spawned a Job object. It tried to spawn a pod, and that took a significant amount of time, far more than the 1 minute between runs.
The next cycle, the CronJob looks to see if it has a running pod due to the .spec.concurrencyPolicy value.
The CronJob does not find a running pod so generates another Job object, which also gets stuck waiting for pod generation.
And ad infinitum.

Each time, a new Job gets added, gets stuck waiting for pod generation for an abnormally long time, which causes another Job to be added to the namespace which also gets stuck…

Eventually, the pod will generate but by then there’s a backlog of Jobs, meaning that even if I suspended the CronJob, it won’t have any effect until the Jobs in the backlog are cleared or deleted.

The Cause

Google investigated further, and found the culprit:

Failed calling webhook, failing open www.up9.com: failed calling webhook "www.up9.com": Post https://up9-sidecar-injector-prod.up9.svc:443/mutate?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

We were testing up9 and this configuration was using a webhook, so it looked like a misbehaving webhook was causing this problem. We removed the webhook and the webhookconfigs, and everything started working again.

Lessons Learnt

So where does this leave us? Well, a few thoughts:

A misbehaving/misconfigured webhook can cause a snowball effect in the cluster, causing multiple runs of a single CronJob without cleanup — successfulJobsHistoryLimit and failedJobsHistoryLimit values are seemingly ignored.
This could break systems where the CronJob is supposed to be run mutually exclusively, since the delay in pod generation could allow two cron pods to spawn together, even though the CronJob has a concurrencyPolicy set as Forbid.
If someone managed (whether intentionally or maliciously) to install a webhook that causes this pod spawning delay, and then adds a CronJob that runs once a minute — and then maliciously crafts the job to never finish, this snowball effect will cause the cluster to run out of resources and/or scale up nodes forever or until it hits the max allowed by your configuration. Fortunately, this is now known by the Kubernetes team.

Johnny Ooi is a Senior Site Reliability Engineer at The Telegraph. You can follow him @Blenderfox.