EKS — Kubernetes — Not Ready nodes

Pier
Geek Culture
Published in
4 min readNov 7, 2022
Photo by dominik hofbauer on Unsplash

Today I’m going to talk about an issue that I encounter a couple of days ago while working on EKS 1.21.
If you like the story remember to clap and hopefully subscribe, I really appreciate that!

I work constantly in Kubernetes and yesterday evening I’ve seen a couple of Kubernetes Nodes that were flapping between Ready/NotReady states.

Number of NotReady Events for a specific node

The symptoms that were telling me that something was not working fine were:

  • Pod in containerCreating state for more than 2 minutes
  • Slower gitlab-runner creation time, due to the bullet point above
  • Pod in Terminating for more than 5 minutes with no pdb/strange graceful termination
  • Alerts that some nodes were flipping between Ready/NotReady state

So I decided to see what was going on.

kubectl describe <NODE-NAME>

In the conditions part, I’ve seen the error:

KubeletNotReady  PLEG is not healthy: pleg was last seen active xx

This has been something never seen before and so I started to ask for help with Google Search.

The first article has been: https://developers.redhat.com/blog/2019/11/13/pod-lifecycle-event-generator-understanding-the-pleg-is-not-healthy-issue-in-kubernetes

This article is very technical and cool to read and understand, it explains what is PLEG and the source code behind it. Thanks to it, you may understand the reason why the issues below bring toPLEG is not healty.

So searching better, I’ve seen more than a couple of issues in GitHub/Reddit:

  • CNI/Network issue — old version — lack of IP
  • A lot of pods scheduled in the same node are constantly created/terminated due to specific workload/autoscaling
  • A restart of the docker daemon / containerd — version mismatch …
  • Issues with mounted volumes
  • Other reasons, as always..

So I checked if there was a new version of the CNI for EKS using:

Ref: https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html

I was using exactly the version mentioned here.
In parallel, I’ve tried to see the worker node logs, the EC2 instance status/monitoring, the VPC configuration, and everything that was able to come to my mind.

Photo by Hello I'm Nik on Unsplash

At some point, I decided to try a different approach comparing a Healthy/NotHealhtynode.
This issue was happening in two completely different clusters, and it was clear that the issue was only given on new worker nodes required by the cluster-autoscaler.

So I decided to take a look closer to the NodeGroup and the system info of the nodes:

kubectl describe node <NODE_NAME> | grep -A 10 "System Info"System Info:   
Machine ID: xxx
System UUID: xxx
Boot ID: xxx
Kernel Version: 5.4.217-126.408.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://20.10.17
Kubelet Version: v1.21.14-eks-ba74326 Kube-Proxy Version: v1.21.14-eks-ba74326

Everything between a Healthy/NotHealhty node was the same.

At some point, I thought about the AMI used in the worker nodes.
AWS releases very often a new AMI for EKS worker nodes and the last released AMI has been on 27/10/22.

Looking closer, I’ve seen that both worker nodes experiencing this issue, were using the same AMI.

Name: amazon-eks-node-1.21-v20221027
ID: ami-00c00a9bbe9b9301a

So what to do?

  1. Rollback to a previous working version
  2. Raise a ticket to AWS

After a few hours I get back from AWS support:

So the issue was exactly the AMI.

Reading this response my feeling has been very bad….

Why no one is sending a notification about an issue like this one?

“If a tree falls in a forest and no one hears it, does it make a noise?
Berkeley argued that objects exist only as perceived. So if a tree falls in a forest and no one hears it, it makes no noise.’’

Looking better I found that the broken version has been
[RECALLED] AMI Release v20221027

..and there is also an official issue on Github: https://github.com/awslabs/amazon-eks-ami/issues/1071

https://github.com/awslabs/amazon-eks-ami/releases

I hope someone can get some help from this blog post and to support me you can clap & subscribe to get more in the upcoming weeks!

Follow Me and Subscribe to get the updates on this and the next series!

--

--

Pier
Geek Culture

DevOps Engineer @Microsoft | Working with Python, C++, Node.js, Kubernetes, Terraform, Docker and more