“In this world, nothing is certain except death, taxes and Kubernetes” — Anonymous
Background and scope
This is the second blog in our two-part series on the adoption of Kubernetes for Intuit’s TurboTax tax filing software. In the first blog, we explained the design of the Kubernetes compute infrastructure, including the slicing of Kubernetes namespaces, AWS account topology, and the design of access to the data platform. We also provided insight into how a compute cluster looks, as shown below in figure 1.
This blog delves into the technical roadblocks faced due to the scale of TurboTax, resolutions made, and lessons learned. It is intended for anyone interested in building an AWS-based Kubernetes platform and learning about using Kubernetes at scale and handling peak loads. For a clearer understanding, it’s best to read Part 1 of the series first.
Figure 1: Compute cluster topology
Oct 2019 — Dec 2019: Readiness testing
To gain confidence for peak tax seasons, the Infrastructure team and application teams set up a robust testing infrastructure to focus on scale and performance testing. This effort included weekly scale tests on the production environment with simulated workloads. The scale testing was done with three times more than the anticipated load. Various scenarios were tested each week, including multiple east/west regional traffic combinations (50/50, 100/0, 0/100, and many in-between), as well as failure testing involving spikes in traffic, regional, and availability zone failures.
In addition to validating functionality and the end-to-end setup, these tests also ensured that service teams were tuning monitoring thresholds, collecting required metrics, preparing queries for logs, and keeping run-books up-to-date in case of incidents.
Unsurprisingly, this testing ran into several roadblocks. The following section highlights the top scale related issues that were encountered and how they were resolved, including:
- Scale problems identified with kube-dns
- Autoscaling slowness during node provisioning
- Issues with external dependencies
- Scale problems identified with KIAM
- Alb-ingress-controller problems
- Log loss/delays at high throughput and volume
Scale issues with kube-dns
“It’s always DNS!”
As with most large-scale technology infrastructure, multiple DNS-related issues were observed and eventually resolved. Intuit’s Kubernetes clusters used kube-dns as the DNS service provider. Figure 2 shows the kube-dns architecture.
Figure 2: kube-dns architecture
Problem: The default DNS cache size was at 1000. For several domains, kube-dns was configured to simply forward DNS resolutions down to the AWS EC2 instance. For others, kube-dns was configured to send the request to OpenDNS. Without adequate caching, DNS requests sent down to the EC2 instance increased delays, exceeded API rate limits, and increased errors.
Resolution: In this scenario, the apps and the cluster were going to benefit from the maximum possible DNS caching. Hence, the cache size was bumped to 10000, which is the max limit on dnsmasq.
Problem: Under heavy load, especially when a large number of pods came up together, dnsmasq was not able to sustain the number of DNS requests. This resulted in pods seeing a failure to resolve DNS queries.
Resolution: It was found that the concurrent queries per pod for dnsmasq was set to 150, which was insufficient. Based on the application’s max expected workload, the dns-forward-max parameter was bumped to 500.
Problem: kube-dns pods were limited to running on the three Kubernetes master nodes. At times of massive scale-up, when there were boot storms of pod startup, the DNS resolution rate exceeded the per-node API rate limit imposed by AWS. As a result, kube-dns failed to resolve DNS.
Resolution: The solution in this case was to expand running kube-dns beyond the three master nodes into six more reserved nodes. Testing revealed that adding double capacity of kube-dns pods was sufficient for this workload.
Auto scaling slowness during node provisioning
Problem: On a few occasions during the tax season, TurboTax expects huge spikes in traffic, including late evenings for a few days before tax day and certainly on tax day itself. For handling these spikes, the services running on Kubernetes were configured to run with Horizontal Pod Autoscaler (HPA). However, HPA wasn’t sufficient. Even if a large number of pods got scheduled because of HPA, they often wouldn’t run because there were not enough nodes in the system.
Resolution: The immediate resolution was to add new nodes to the cluster so that the pods could run.
To prevent the case of insufficient nodes moving forward, a few spare nodes were always run in the cluster, especially around the tax peak season. The amount and timing for running extra nodes was configurable and dynamic depending on needs.
This was made possible by using Pod PriorityClass objects. The solution was to deploy very low-priority pause pods, thereby taking up buffer space on the nodes. When the capacity was needed by a real application, the pause pods were preempted and real Pods took their place.
Node bootup failures due to missing external dependencies
Problem: All Kubernetes clusters were configured such that the cluster nodes would download some packages over the internet as part of node bootup if they didn’t exist in the Amazon Machine Image (AMI). This introduced an external dependency in the system.
Resolution: On a particular occasion, one of the dependency rpm packages that was being downloaded from the network went missing from the distribution servers. As a result, when the cluster auto scaler brought up new nodes, they failed to come up and got terminated. This broke the cluster auto scaler.
The fix was to find all external dependencies and bake them into a custom AMI. This included external packages as well as Docker images required for the control plane (kube-proxy, calico, etc.)
Scale issues KIAM
Problem: Kiam-server was running as a daemonset. While upgrading the cluster, daemonset pods are not drained and terminated gracefully. As a result, if user pods were in the middle of getting their IAM credentials while the nodes running kiam-server were being upgraded, those calls would fail.
Resolution: This was fixed by running kiam-server as a deployment (with pod anti-affinity to keep the pods running on different nodes).
Other daemonsets were also evaluated for the same problem. The aws-iam-authenticator daemonset was also converted to a deployment.
ALB ingress controller problems
Problem: The alb-ingress-controller configured ALBs such that all ALBs had all nodes in the cluster in its target groups. This caused health check storms. For each ALB, when AWS performed a health check, the requests would get sent to all the nodes in the cluster.
Resolution: Multiple changes were done to work around this problem.
- Restrict the number of ingress objects on a cluster.
- Change the health check interval on the ALBs to reduce frequency.
- Restrict nodes in the ALB target group to only nodes with specific labels.
- Configure the ALB health check to be a different one than the Kubernetes readiness probe. ALB health check was lightweight.
- Set future state requirements to move to a flat network that is pod aware, given that the shared networking model does not scale.
Problem: Too many ALBs in the cluster caused alb-ingress-controller to get throttled for exceeding AWS API rate limits
Resolution: AWS has hard API rate limits. Therefore, the only option here was to restrict the number of ALBs (ingress objects) allowed in a cluster.
Problem: There were many instances of lost and/or delayed logs in our log aggregator.
Resolution: After extensive performance testing, it was determined that fluentd (log collector), which was running as a daemonset in the cluster, wasn’t able to keep up with the high logging throughput of several applications. This became increasingly apparent as the node size was increased to allow for bin packing multiple pods on a single node.
The solution was to run a log collector as a sidecar in each pod, so that the log collecting and processing scaled with the number of pods, rather than the number of nodes. This approach was initially resisted, but further testing revealed that the CPU utilization of the sidecar logging container was negligible and didn’t affect pod performance.
All these tests, associated root cause analysis, and resolution of the problems gave the teams much deeper understanding of running Kubernetes on AWS. The processes involved in fixing the problems got streamlined as well. Eventually there was more confidence that the entire setup would be able to sustain the load for tax peak.
Jan — April 2020: Tax season
In most years (2020 excluded), the main tax season runs from January until April 15, when tax returns are due. There are two peaks of traffic, which we refer to internally as first and second peak:
- First peak comes in late January — early February, when the majority of US employees receive their W2s. If a TurboTax customer is owed money, they tend to file immediately.
- Second peak comes in the week leading up to April 15, when everyone who hasn’t yet filed rushes to finish and submit their tax returns by the deadline.
The following two charts provide a glimpse of the tax season from January 1, 2020 — April 15, 2020.
Chart 1 shows that the activity picked up around January 22 with load reaching ~40K requests/sec, rising to ~60K requests/sec in the first week of February (first peak). There was sustained load from mid-February through April 10. The last few days of the tax season show bigger spikes (second peak). The periodic huge spikes from mid-February to April 10 represent the additional testing done by the team in the production environment to ensure that it is able to sustain the load.
Chart 2 shows that the Kubernetes clusters typically used between 600 and 700 nodes running ~1,100 pods. There were ~100 additional nodes provisioned (overprovisioned) for the last few days of the tax peak.
Chart 1: Total number of transactions/HTTP requests
Chart 2: Total number of Pods and Nodes
The first generation of Intuit’s Kubernetes platform was built based on Kubernetes primitives and a custom control plane. Running the most critical services for TurboTax on this platform was a highly rewarding experience. Moving forward, the goal is to continue moving the remaining services in TurboTax as well as other Intuit offerings to the Kubernetes-based platform. With learnings from this experience, Intuit is now building the next generation of supercharged Kubernetes platform using widely adopted Open Source Projects such as Keiko, Argo and Admiral. We are hiring — come join us!
Intuit will be participating at the next KubeCon North America conference, held virtually from November 17–20, 2020, where we’d be happy to discuss this project further.
This monumental accomplishment is a tribute to the hard work and dedication of incredibly talented individuals from across the company. Throughout the journey, Intuit TurboTax and Developer Platform engineers applied continuous testing, data-driven decisions, and focused automation to successfully tackle this audacious challenge.
About the Authors
Anusha Ragunathan is a software engineer at Intuit, where she works on building and maintaining the company’s Kubernetes Infrastructure. Anusha is passionate about solving complex problems in systems and infrastructure engineering, and is an OSS maintainer of the Moby (Docker) project. Prior to Intuit, she worked on building distributed systems at Docker and VMware. Her interests include containers, virtualization, and cloud-native technologies.
Shrinand Javadekar is a software engineer in the Modern SaaS Team at Intuit, whose mission is to make Kubernetes the de facto standard for developing, deploying, and running apps at Intuit. The open source project Keiko was born from this work. In the past, Shrinand has been part of large-scale filesystem and virtualization projects at EMC and VMware. However, his most fun gigs have been working on cloud-native platforms and services at startups such as Maginatics and Applatix, and now at Intuit.
Corey Caverly is an architect in the Consumer Tax Group at Intuit working on Site Reliability Engineering. This team is focused on building tools, processes, and patterns that help produce reliable and performant customer experiences. Corey has worked everywhere from universities to biotech; his prior gig was leading a team that developed tools and services to deliver software infrastructure for robots that build DNA based on customer specifications.
Jonathan Nevelson is a software engineer in the Consumer Tax Group at Intuit focusing on Site Reliability Engineering. His primary focus is building a common platform for services to run on Kubernetes and ensuring their performance, security, and reliability. Jonathan’s prior experience includes working with and leading teams across the development stack, from frontend early in his career, to backend and distributed systems work, before finally making his way into SRE and infrastructure.
Rene Martin is a software engineer in the Consumer Tax Group at Intuit. He and the team he leads are focused on consistency, reliability, security, scalability, and performance at scale. Rene has developed his career around the Site Reliability space with a product development mindset. In his previous role, he led the team that supported highly dynamic and global infrastructure for an Internet advertisement company.