EKS — Kubernetes — Not Ready nodes
Today I’m going to talk about an issue that I encounter a couple of days ago while working on EKS 1.21.
If you like the story remember to clap and hopefully subscribe, I really appreciate that!
I work constantly in Kubernetes and yesterday evening I’ve seen a couple of Kubernetes Nodes that were flapping between Ready/NotReady
states.
The symptoms that were telling me that something was not working fine were:
- Pod in
containerCreating
state for more than 2 minutes - Slower
gitlab-runner
creation time, due to the bullet point above - Pod in
Terminating
for more than 5 minutes with no pdb/strange graceful termination - Alerts that some nodes were flipping between
Ready/NotReady
state
So I decided to see what was going on.
kubectl describe <NODE-NAME>
In the conditions part, I’ve seen the error:
KubeletNotReady PLEG is not healthy: pleg was last seen active xx
This has been something never seen before and so I started to ask for help with Google Search.
The first article has been: https://developers.redhat.com/blog/2019/11/13/pod-lifecycle-event-generator-understanding-the-pleg-is-not-healthy-issue-in-kubernetes
This article is very technical and cool to read and understand, it explains what is PLEG and the source code behind it. Thanks to it, you may understand the reason why the issues below bring toPLEG is not healty.
So searching better, I’ve seen more than a couple of issues in GitHub/Reddit:
- CNI/Network issue — old version — lack of IP
- A lot of pods scheduled in the same node are constantly
created/terminated
due to specific workload/autoscaling - A restart of the docker daemon / containerd — version mismatch …
- Issues with mounted volumes
- Other reasons, as always..
So I checked if there was a new version of the CNI for EKS using:
I was using exactly the version mentioned here.
In parallel, I’ve tried to see the worker node logs, the EC2 instance status/monitoring, the VPC configuration, and everything that was able to come to my mind.
At some point, I decided to try a different approach comparing a Healthy/NotHealhty
node.
This issue was happening in two completely different clusters, and it was clear that the issue was only given on new worker nodes
required by the cluster-autoscaler.
So I decided to take a look closer to the NodeGroup
and the system info of the nodes:
kubectl describe node <NODE_NAME> | grep -A 10 "System Info"System Info:
Machine ID: xxx
System UUID: xxx
Boot ID: xxx
Kernel Version: 5.4.217-126.408.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://20.10.17
Kubelet Version: v1.21.14-eks-ba74326 Kube-Proxy Version: v1.21.14-eks-ba74326
Everything between a Healthy/NotHealhty
node was the same.
At some point, I thought about the AMI used in the worker nodes.
AWS releases very often a new AMI for EKS worker nodes and the last released AMI has been on 27/10/22.
Looking closer, I’ve seen that both worker nodes experiencing this issue, were using the same AMI.
Name: amazon-eks-node-1.21-v20221027
ID: ami-00c00a9bbe9b9301a
So what to do?
- Rollback to a previous working version
- Raise a ticket to AWS
After a few hours I get back from AWS support:
So the issue was exactly the AMI.
Reading this response my feeling has been very bad….
Why no one is sending a notification about an issue like this one?
“If a tree falls in a forest and no one hears it, does it make a noise?
Berkeley argued that objects exist only as perceived. So if a tree falls in a forest and no one hears it, it makes no noise.’’
Looking better I found that the broken version has been[RECALLED] AMI Release v20221027
..and there is also an official issue on Github: https://github.com/awslabs/amazon-eks-ami/issues/1071
I hope someone can get some help from this blog post and to support me you can clap & subscribe to get more in the upcoming weeks!