EKS — Kubernetes — Not Ready nodes

Published in

Geek Culture

4 min readNov 7, 2022

Today I’m going to talk about an issue that I encounter a couple of days ago while working on EKS 1.21.
If you like the story remember to clap and hopefully subscribe, I really appreciate that!

I work constantly in Kubernetes and yesterday evening I’ve seen a couple of Kubernetes Nodes that were flapping between Ready/NotReady states.

Number of NotReady Events for a specific node

The symptoms that were telling me that something was not working fine were:

Pod in containerCreating state for more than 2 minutes
Slower gitlab-runner creation time, due to the bullet point above
Pod in Terminating for more than 5 minutes with no pdb/strange graceful termination
Alerts that some nodes were flipping between Ready/NotReady state

So I decided to see what was going on.

kubectl describe <NODE-NAME>

In the conditions part, I’ve seen the error:

KubeletNotReady  PLEG is not healthy: pleg was last seen active xx

This has been something never seen before and so I started to ask for help with Google Search.

The first article has been: https://developers.redhat.com/blog/2019/11/13/pod-lifecycle-event-generator-understanding-the-pleg-is-not-healthy-issue-in-kubernetes

This article is very technical and cool to read and understand, it explains what is PLEG and the source code behind it. Thanks to it, you may understand the reason why the issues below bring toPLEG is not healty.

So searching better, I’ve seen more than a couple of issues in GitHub/Reddit:

CNI/Network issue — old version — lack of IP
A lot of pods scheduled in the same node are constantly created/terminated due to specific workload/autoscaling
A restart of the docker daemon / containerd — version mismatch …
Issues with mounted volumes
Other reasons, as always..

So I checked if there was a new version of the CNI for EKS using:

Ref: https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html

I was using exactly the version mentioned here.
In parallel, I’ve tried to see the worker node logs, the EC2 instance status/monitoring, the VPC configuration, and everything that was able to come to my mind.

At some point, I decided to try a different approach comparing a Healthy/NotHealhtynode.
This issue was happening in two completely different clusters, and it was clear that the issue was only given on new worker nodes required by the cluster-autoscaler.

So I decided to take a look closer to the NodeGroup and the system info of the nodes:

kubectl describe node <NODE_NAME> | grep -A 10 "System Info"System Info:   
Machine ID:                 xxx   
System UUID:                xxx   
Boot ID:                    xxx   
Kernel Version:             5.4.217-126.408.amzn2.x86_64   
OS Image:                   Amazon Linux 2   
Operating System:           linux   
Architecture:               amd64   
Container Runtime Version:  docker://20.10.17   
Kubelet Version:            v1.21.14-eks-ba74326   Kube-Proxy Version:         v1.21.14-eks-ba74326

Everything between a Healthy/NotHealhty node was the same.

At some point, I thought about the AMI used in the worker nodes.
AWS releases very often a new AMI for EKS worker nodes and the last released AMI has been on 27/10/22.

Looking closer, I’ve seen that both worker nodes experiencing this issue, were using the same AMI.

Name: amazon-eks-node-1.21-v20221027
ID: ami-00c00a9bbe9b9301a

So what to do?

Rollback to a previous working version
Raise a ticket to AWS

After a few hours I get back from AWS support:

So the issue was exactly the AMI.

Reading this response my feeling has been very bad….

Why no one is sending a notification about an issue like this one?

“If a tree falls in a forest and no one hears it, does it make a noise?
Berkeley argued that objects exist only as perceived. So if a tree falls in a forest and no one hears it, it makes no noise.’’

Looking better I found that the broken version has been
[RECALLED] AMI Release v20221027

..and there is also an official issue on Github: https://github.com/awslabs/amazon-eks-ami/issues/1071

https://github.com/awslabs/amazon-eks-ami/releases

I hope someone can get some help from this blog post and to support me you can clap & subscribe to get more in the upcoming weeks!

Follow Me and Subscribe to get the updates on this and the next series!

EKS — Kubernetes — Not Ready nodes

Reading this response my feeling has been very bad….

Written by Pier