Rapid Auto-Scaling on EKS — Part 1

Eytan Avisror
keikoproj

--

The ability to autoscale quickly is a crucial ability for any application to have, but it’s also a critical requirement for any platform team that needs to rotate nodes on large clusters.

We’ve been building the Kubernetes Platform @ Intuit for the last few years, more recently on top of Amazon’s EKS, and started noticing how painfully long it takes to rotate 500-node clusters, especially when you have hundreds of clusters— we needed to figure out why things were taking so long, and come up with a way to speed it up, and our platform tenants were getting frustrated from the long upgrade windows.

Node Startup

An EC2 instance goes through several stages on its way to become a functional node in the cluster. First, the instance is Provisioned and needs to boot up, then it gets Bootstrapped by running a user-data script, next it joins the Kubernetes cluster as a node and becomes Ready, and finally, pods are scheduled on the new node.

We examined how long each stage took for Amazon Linux 2, with a basic user-data script, and our current cluster configuration for EKS.

Next, we wanted to evaluate why these stages take this long and try to come up with ways we can speed it up.

Fast & Functional Readiness

First, we wanted to understand why our nodes were taking anywhere between 3 to 4 minutes to enter a Ready state, if this is the case in your cluster, keep reading — you are about to improve it dramatically.

While wanting to improve the time for a node to become Ready, it’s also important to state that we wanted Functional Readiness, meaning, we wanted a node to actually be fully ready, with certain daemonsets fully up, such as node-local-dns, kiam-agent, etc.

Always Blame the DNS

Our first finding was quite shocking, kube-proxy was spending approximately 40s doing absolutely nothing but trying to resolve a bad node name of itself before loading up and creating iptable rules for service endpoints, which in turn caused a delay for other daemonsets such as node-local-dns, kiam-agent and even aws-node (EKS CNI) which acts as the final gate for a node’s readiness.

E0420 06:48:19.413329       1 node.go:125] Failed to retrieve node info: nodes "ip-10-105-232-116.vpc.internal" not found
E0420 06:48:20.584384 1 node.go:125] Failed to retrieve node info: nodes "ip-10-105-232-116.vpc.internal" not found
E0420 06:48:22.884535 1 node.go:125] Failed to retrieve node info: nodes "ip-10-105-232-116.vpc.internal" not found
E0420 06:48:27.161128 1 node.go:125] Failed to retrieve node info: nodes "ip-10-105-232-116.vpc.internal" not found
E0420 06:48:36.482608 1 node.go:125] Failed to retrieve node info: nodes "ip-10-105-232-116.vpc.internal" not found
E0420 06:48:55.017438 1 node.go:125] Failed to retrieve node info: nodes "ip-10-105-232-116.vpc.internal" not found
I0420 06:48:55.017464 1 server_others.go:178] can't determine this node's IP, assuming 127.0.0.1; if this is incorrect, please set the --bind-address flag

Rather strangely, kube-proxy was trying to retrieve its own node information by using the hostname of the EC2 instance, ip-10-105-232-116.vpc.internal; however, the node name is actually ip-10–105–232–116.us-west-2.compute.internal , this was happening because in our DHCP options the domain name is set to .vpc.internal, and it meant that by giving kube-proxy the node name to use instead, we would avoid this delay entirely, and any potential delay that followed due to dependent pods restarting.

We were able to work around this by adding the following snippet to kube-proxy’s daemonset to override the hostname with the node name via the downward API.

- command:
- kube-proxy
- --hostname-override=$(NODE_NAME)
- --v=2
- --config=/var/lib/kube-proxy-config/config
env:
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName

This tiny change took our readiness time down to~2m!

Before

NAME                    READY   STATUS              RESTARTS   AGE
test-6fd7585cbf-4rmzs 0/1 ContainerCreating 0 3m32s

After

NAME                    READY   STATUS              RESTARTS   AGE
test-6fd7585cbf-kj5dp 0/1 ContainerCreating 0 110s

The Race Condition

The next thing we noticed when we looked at readiness time, is that some daemonset pods randomly restart right after they get scheduled, it looked somewhat like this:

NAME                 READY   STATUS        RESTARTS   AGE
aws-node-j6jqm 1/1 Running 1 6h6m
aws-node-ktmvp 1/1 Running 0 5h36m
aws-node-kxpph 1/1 Running 0 5h57m
aws-node-ml7hz 1/1 Running 0 26m
aws-node-mlrcx 1/1 Running 1 26m
aws-node-sdzb2 1/1 Running 0 6h7m
aws-node-skcjn 1/1 Running 0 9s
aws-node-sz76g 1/1 Running 0 6h16m
aws-node-tm8gn 1/1 Running 1 11d
aws-node-wqs8l 1/1 Running 1 11d
aws-node-wxgbc 1/1 Running 1 6h16m

If aws-node was getting restarted occasionally when it joins, and it’s responsible for marking a node Ready, this was certainly causing some delays.

It turns out this was happening because of a scheduling and start-up race condition — when nodes join the cluster, they immediately get scheduled with daemonset pods, if these pods have any dependencies on each other, such as kube-proxy and aws-node, and they start up at the wrong time, one will restart causing precious seconds of delay.

In the case of aws-node, it clearly shows it could not reach the Kubernetes service endpoint by the time it started:

ERROR: logging before flag.Parse: E0428 04:35:27.000103      10 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout)
time="2021-04-28T04:35:27Z" level=error msg="failed to initialize service object for operator metrics: OPERATOR_NAME must be set"
time="2021-04-28T04:35:27Z" level=error msg="failed to get resource client for (apiVersion:crd.k8s.amazonaws.com/v1alpha1, kind:ENIConfig, ns:): failed to get resource type: failed to get the resource REST mapping for GroupVersionKind(crd.k8s.amazonaws.com/v1alpha1, Kind=ENIConfig): the cache has not been filled yet"
panic: failed to get resource type: failed to get the resource REST mapping for GroupVersionKind(crd.k8s.amazonaws.com/v1alpha1, Kind=ENIConfig): the cache has not been filled yet
goroutine 55 [running]:
github.com/operator-framework/operator-sdk/pkg/sdk.Watch(0x5603650891f6, 0x1e, 0x5603650680c6, 0x9, 0x0, 0x0, 0x12a05f200, 0x0, 0x0, 0x0)
/go/pkg/mod/github.com/operator-framework/operator-sdk@v0.0.7/pkg/sdk/api.go:49 +0x46e
github.com/aws/amazon-vpc-cni-k8s/pkg/eniconfig.(*ENIConfigController).Start(0xc00049b1a0)
/go/src/github.com/aws/amazon-vpc-cni-k8s/pkg/eniconfig/eniconfig.go:164 +0x196
created by main._main
/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/aws-k8s-agent/main.go:49 +0x54d

This means kube-proxy is losing the race between setting up iptables, and aws-node starting up and trying to use those rules to reach the Kubernetes API.

By using an initContainer, we are able to bridge this problem using a simple script that waits for a connection to the Kubernetes API and thus prevents a costly restart of the container.

initContainers:                                
- name: init-kubernetes-api
image: busybox:1.28
command:
- sh
- -c
- |
until nc ${KUBERNETES_SERVICE_HOST} 443 -vz -w 1; do
echo waiting for kubernetes service endpoint; sleep 1;
done

Adding this snippet to the aws-node daemonset, would mean it will only try to start the CNI once kube-proxy has already loaded iptable rules, and we can establish a connection to the Kubernetes API.

In our case we chose to add further gating for other daemonsets we run, such as kiam-agent, etc. to make sure we have functional readiness, if you have the same case, you can add a similar wait condition for checking the existence of the kiam iptable rule, or try to connect to any other service endpoint your nodes & workloads depend on, and waiting until these conditions are met as well.

Update 5/7/21:
Apparently there is an issue where sometimes a pod may come up without `KUBERNETES_SERVICE_HOST` being exported by kubelet, in order to avoid this condition, you can either hard-code the Kubernetes API address (172.20.0.1 for EKS) instead of relying on that variable, or just restart the initContainer in cases where the variable is empty if you want a more generic solution, here is why this happens.
I’ve noticed this only seem to happen on nodes that come up on a cluster that was created moments before.

# Restart initContainer if env variable is not set
if [[ -z "${KUBERNETES_SERVICE_HOST}" ]]; then
echo environment variables not loaded yet, restarting
sleep 1
exit 1
fi

Results

Nodes are functionally ready within 31 seconds and no more pod restarts are seen!

NAME                 STATUS     ROLES     AGE     VERSION
ip-10-198-49-19... Ready node 31s v1.18.8-eks-7c9bda

So as a result, by fixing these two minor misconfigurations we are able to reduce 3 minutes out of the time it took nodes to become ready.
When you consider the impact on the one-by-one rolling upgrade of nodes for a 100-node scaling group, this would mean ~5hrs saved from the process!

Read Part 2 to find out how we used AWS Warm Pools with instance-manager in order to speed up the provisioning & bootstrapping stages.

--

--