coreDNS is going to fail you scale k8s
This article is about an incident that happened in our critical EKS cluster which hosts our Observability stack and how we fixed it and our recommendations on how to guard against it.
Disclaimer: The title has to be “How default settings for coreDNS is going to fail you scale k8s”. Sorry for clickbaity title.
We have hosted our Observability stack based on VictoriaMetrics on an AWS EKS cluster which had 900+ nodes and 15k+ pods running. On one fine day, all our alerts were on fire saying that no data ingestion happening and Grafana, metrics API everything were down. On checking, we could see that, the nodes were getting scaled down rapidly. Most of the pods for various micro-services were stuck in the crashloopbackoff.
What we have observed initially
1) Grafana was down and went into frequent crashloopbackoff.
2) The stateful micro-services which were connecting to EBS volumes were not even starting.
3) The worker nodes in the EKS cluster were rapidly coming down from 900+ to 600+
4) We could notice the pods with HPA were rapidly getting scaled down to minimum replica count.
What we were checking in the EKS Cluster
- First we checked the cluster auto-scaler logs why it was reducing the nodes and found out it was merely reducing the number of nodes based on the CPU and Memory stats from the metric-server installed. So no issue with cluster autoscaler. In the meantime we were also trying to increase the nodes by increasing it manually. Despite that, nodes were getting downscaled.
- We were also parallelly checking why our Grafana pods were going to crashloopbackoff and found the backend AWS RDS was not reachable. We checked the health of the AWS RDS and it was working fine. We were able to connect to DB from outside EKS. So, ruled out the issues with AWS RDS.
- One of our application’s micro-service which is running as stateful-set with EBS backed volumes were are all down and since the EBS volumes were not reachable.
Our initial thought was it is something to do with some kind of AWS outage and reached out to AWS and found there was no issue from their end too.
Unfortunately we were also running an EKS version which went out of support recently by EKS and we weren’t able to migrate the problematic pods to a new node-group without auto scaling(No new node group creation allowed in unsupported AWS EKS versions)
One other thing we were noticing was, the pods that doesn’t have external connectivity were running fine.
Just by chance we were checking the coreDNS pods and we found out it was having OOM issues and was getting restarted due to that. It was happening multiple times. Whenever coreDNS pod was coming back up after a crash, the services were kind of recovering. But that didn’t last long.
Due to the hammering of multiple requests, the memory usage was again going up in coreDNS pods and they were crashing due to OOM. Again all the connectivity in-between the pods and external connectivity out to the RDS, EBS were failing.
HPA scaling is based on the resource usage data from the metric-server component running in the kube-system namespace. Since the incoming metrics ingestion was stopped and no actual processing being done on the pods now, the resource usage was going down rapidly and HPA was scaling down most of the pods to the minimum replica count due to that and scaling down of the pods was leading to scaling down of the nodes ultimately. A domino effect was set off. This was our theory by then and started working towards fixing it.
We could have increased the memory of the coreDNS pods. If we increase the memory, the coreDNS pods would get restarted and again we would lose all the connectivity in between the pods and outside for some more time. So, we decided we would scale the coreDNS pods and increased it to 3 replicas from existing 2 replicas. Still it was struggling to handle the load even though it was recovering it was taking more time. So, to speed track the recovery we increased the coreDNS pod replica count to 5.
It has eased the load on the other replicas and the metric server was able to scrape the details about the all the pods now. The stateful-set pods with EBS volumes were coming up one by one. Grafana was able to connect to RDS, as it was able to resolve the DNS names by then. The other micro-services pods were scaling up as the resource usage was hitting the normal usage threshold and the nodes downscaling was stopped and it started adding required nodes to handle the pod increase. It took almost 3 to 4 hours for the EKS cluster to become stable at required scale.
Lessons learnt
1) The default add-on for the coreDNS that comes with the AWS EKS has resources that are not enough for running 500+ nodes and 10k+ pods. And 2 replicas are definitely not going to handle the traffic if you are going to scale above this point. At the least, our recommendation is to have a HPA configured for the coreDNS. Read the below attached article at the further reads section to know more about scaling strategies for coreDNS.
2) Migrate to the supported versions of the AWS EKS service sooner. When we were thinking about the alternative of creating a new node-group with fixed number of the pods, which would not be downscaled, we didn’t had option to create one. The AWS support personnel were also saying that even they don’t have a support api that would help them to do it themselves for the unsupported versions. So, make sure this is taken care.
3) Have an alert setup for the critical components like the coreDNS, metric-server for their resource usage and if needed have a proper scaling strategy setup for them. At this scale, you have to be taking a lot of care about the control plane components as well, even though you are using Kubernetes managed service.
Further Reads: