Breaking Kubernetes: How We Broke and Fixed our K8s Cluster
By Salil Gupta
Over the last few years Kubernetes has become a prominent part of the Civis Data Science Platform. We first utilized Kubernetes for Jupyter Notebooks in the cloud. Deploying containerized Jupyter Notebooks onto Kubernetes, our clients were able to do their exploratory analysis in a single place, seamlessly connect to their data assets, and request the compute power needed to run their computations. Soon after, we rolled out Services. Services allowed our clients to deploy ad hoc, containerized applications without having to worry about the intricacies of managing application infrastructure. Last fall, we released our largest move to Kubernetes, our Bring-Your-Own-Code infrastructure.
One of our most used products, Python, R and Container scripts allow clients to run Docker Containers in the cloud without having to worry about the underlying infrastructure. Clients have the ability to schedule these scripts, string them together into a Platform Workflow, request the necessary compute needed to run their container, and connect to the necessary data assets. We originally built this product on top of a home built infrastructure called Bocce. It worked well while it was around. However, with Kubernetes becoming a core component of our infrastructure we felt it was best to leverage the capabilities of Kubernetes and migrate off of Bocce.
We first internally released our migration (we called it Croquet; we like lawn games??) in September of 2018. After releasing Croquet, we began to see stability issues with our Kubernetes cluster. Developers reported high latency for kubectl commands while users intermittently reported having difficulty starting notebooks and services. We discovered serious issues when our cluster was overloaded with a large number of scripts. The large number of scripts pushed the cluster to a total size of about 290 EC2 instances; our previous usage never pushed us past 80 EC2 instances. This manifested itself in high CPU on our master nodes. Further investigation found that the three Kubernetes API containers were pegged on CPU. The API servers were given unbounded CPU, so they chewed through the node’s CPU. This caused resource contention among other control plane components, rendering the cluster useless. Furthermore, the K8s API server would chew through all the memory on the node before restarting. Once we moved internal users back to the old infrastructure, the cluster began to stabilize.
The biggest mistake we made with this release was assuming that Kubernetes does everything for free. Kubernetes says that it can support clusters with up to 5,000 nodes and 100,000 pods running. However, there are many levers that need to be pulled in order to achieve that scale.
Changes We Made
Before we began making adjustments to our cluster we set our goal to support a cluster of 500 nodes with a stretch goal of 1,000 nodes. We estimated these numbers based on our anticipated usage and future growth plans.
We utilized our staging cluster to run these load tests. We increased load on the cluster by incrementally increasing the desired size of the AWS autoscaling group that backed the cluster. This not only added more nodes to the cluster, but also added more pods to our cluster. This was because we deploy multiple Kubernetes DaemonSets on our cluster.
After setting goals, we went about improving visibility into our cluster’s health. We had CloudWatch metrics on the CPU usage for the master nodes and Datadog Agents running on the nodes that gave us CPU and memory metrics for each container. However, we didn’t have access to API request throughput, 5xx error rates, request latency, or other critical metrics. Prometheus was the obvious choice and we went about deploying it onto our cluster with the help of the Prometheus operator by CoreOS.
Resizing our Master Nodes
After the outage, our initial reaction was to increase the size of our master nodes. We were running three m4.large instances (2 cores of CPU, 8 GiB of Memory). Kubernetes recommends for a cluster sized between 251–500 nodes, on AWS, the master nodes should be c4.4xlarge (16 CPU, 30 GiB Memory). Below is a table of the recommendations:
The only issue with Kubernetes’ recommendation was that it did not define the number of replicas, so we assumed they meant a single master node. However, we run three master nodes spread across separate Availability Zones. We fiddled with different configurations but ultimately settled on using m4.4xlarge instances (16 CPU, 64 GiB Memory). Below are the results from a load test with three m4.4xlarge instances:
In summary, while increasing the master nodes to m4.4xlarge instances did help us scale up, we still weren’t satisfied because we could not comfortably cross 500 nodes. Although we thought about using even larger instances for the master nodes, it felt like we were treating a symptom and not the root cause of the problem. We also thought about increasing the number of replicas from three to five. However, etcd rarely recommends running more than three replicas and with our cluster management tool, Kops, we were bound to an etcd container per master replica. Furthermore, we weren’t convinced more replicas would solve the root problem.
Investigating our DaemonSets
Next, we investigated what objects in our cluster could be causing the increased load. We run a number of DaemonSets in our cluster: Datadog Agents for monitoring; Sumologic collectors for logs; kube2iam to bork container access to underlying EC2 metadata. Open AI wrote an excellent blog post on how they scaled their Kubernetes cluster to 2,500 nodes and one of their findings was their Datadog and fluentd DaemonSets aggressively polling the API server causing increased load.
In our next load test we removed our Sumologic collector, which is built on fluentd, and Datadog DaemonSets. Below are the results:
From this load test, it was obvious that Datadog and/or Sumologic were the culprit for the load. We ran a follow-up test where we added back Sumologic and the results were similar, our cluster scaled pretty seamlessly to 700+ nodes. This made us sure that Datadog was the problem.
Other Minor Adjustments
We made two other adjustments to our cluster configuration. The first was we increased the rate limit for non-mutating requests from 400 to 1200 at any given time. Combing through the access logs we saw multiple 429 status codes. We made this adjustment before we discovered the issue with our Datadog Agents. In subsequent load tests, we saw 429 responses drop but no significant change in the memory and cpu profiles for the API servers.
While rate limiting is critical for protecting API servers from load, we kept the change in place because we didn’t see any detrimental changes to the cluster. You can read up on the API server configurations here. We changed the max-requests-inflight option.
Another change we made was increasing the amout of disk IOPS for our etcd containers. In the Open AI blog, they saw a significant drop in disk write latency when they moved their master nodes to use local disk instead of network backed disk. However, Kops was having an issue with generating a correctly sized root partition on local disk backed nodes like the m5d series. Instead, we increased IOPS by increasing the underlying EBS volumes from 20 to 100 GB thus going from 100 to 300 IOPS. However, we didn’t see much improvement. Ultimately, we kept the adjustment since the cost was negligible and it didn’t hurt performance.
Setting Up Datadog For Scale
After discussing with our Datadog reps and combing through our API access logs, we learned that Datadog agents were polling the Kubernetes API for metadata about K8s services. Specifically, the agent would collect the names of all services in the cluster and any pod running on the agent’s node that was fronted by a service would have its metrics tagged by the service name. The offending configuration is DD_KUBERNETES_COLLECT_METADATA_TAGS and is listed in the README. With hundreds of Datadog Agents, one per node, hitting the K8s API every x seconds looking for new services, it was causing unnecessary load on the API. Furthermore, the tags being collected were not useful for our monitoring purposes.
Datadog has two solutions. The first is you can turn off the configuration. The second is the Datadog cluster agent, a new product they released to address scale. The cluster agent acts as a buffer between the API server and the Datadog pods so only the Datadog Cluster agent can talk to the K8s API. Now, instead of n Datadog pods, where n is the number of nodes in the cluster, hitting the API, you only have a single Cluster Agent communicating with the API. Furthermore, we were able to leverage Datadog to scrape Prometheus metrics from our control plane components and remove the overhead of maintaining our own Prometheus set up.
With the Datadog adjustments in place, we reran load tests and saw the cluster easily breeze towards 700+ EC2 instances.
Word of Advice on Auto-Discovery
A core paradigm of Kubernetes is auto-discovery. Kubernetes’ watch stream allow services to watch for changes in the cluster state allowing services to respond to changes on demand without manual intervention from system admins. However, this can come at a cost. With poor design, you can quickly introduce scaling issues with greedy services watching for changes. This is a common pattern we’ve seen with a variety of our DaemonSets.
The Datadog agents were an extreme case, but a couple of our other DaemonSets had similar issues. Kube2iam, a service for access control to EC2 metadata, watched for changes to all pods in the cluster so it could make adjustments to access controls. However, as more pods were added to the cluster, the Kube2iam containers would run out of memory because they were storing all the pods’ metadata in memory. The Fluentd Kubernetes Metadata Filter is a popular plugin that also has the same issue. Linked below is a Github pull request and issue to highlight the problems.
In general, we’ve learned to be extra cautious with the third party solutions we include in our cluster, especially DaemonSets. Often times, the fix is a minor configuration change to reduce the “greediness” of the service, but as we saw with load testing, it isn’t always immediately evident.
We still have a lot more work to do to improve our cluster’s scalability. The biggest change we would like to implement is upgrading etcd from 2.2.1 to 3.x. The 3.x versions of etcd come with a slew of scalability upgrades that I won’t go into but are discussed here. We are waiting on Kops to come up to speed on making the migration from etcd 2.x to 3.x safe.
After re-releasing internally, users started to complain about scripts failing because of DNS timeout errors. DNS in Kubernetes is unique in that it needs to resolve canonical names within the cluster first before going out to the rest of the internet. While it was hard for us to pinpoint the exact cause of the issue since we could not reliably reproduce the issue, it almost certainly had to do with scaling our cluster. We put in a mitigating solution which was to change the DNS policy from ClusterFirst to Default for scripts. With ClusterFirst, DNS resolution is attempted internally before going out to the internet. With Default, DNS resolution circumvents internal resolution and goes directly out to the internet significantly reducing latency. This was fine for our use case since we do not allow client scripts to communicate with other workloads running within the cluster. However, in the future we would like to allow client scripts to communicate internally so we’ll have to put in work to solve this issue.
Since these adjustments, we have not had issues with our cluster. In the end, we concluded that our cluster could support 700 nodes. Our biggest takeaway from this experience is to be mindful of the third party resources you utilize in your Kubernetes cluster — many tools are built for standard usage but have not been tested at scale.