“In this world, nothing is certain except death, taxes and Kubernetes” — Anonymous
Background and scope
TurboTax is the US market leader in tax filing software. Each year, TurboTax processes taxes for over 40 million taxpayers, including over 5 million desktop software users, over 10 million mobile users, and over 20 million online users. Over the course of the tax season, TurboTax handles millions of documents involving hundreds of billions of dollars, and performs 1.5 trillion service transactions.
Each year, TurboTax sees peak usage in the months of January as many taxpayers receive refunds; April, as they file their taxes; and October, as any extensions files come due. The scale, seasonality, and strict deadlines of tax filing revealed interesting infrastructure challenges of scale and performance. As of 2020, the majority of TurboTax’s supporting services run on Intuit’s Kubernetes platform.
This is a two-part blog structured as a timeline of events detailing TurboTax’s Kubernetes journey. Part one discusses the planning and design of the infrastructure related to the migration. Part two delves into the technical roadblocks faced due to scale, resolutions made, and lessons learned. It is intended for anyone interested in building an AWS-based Kubernetes platform and learning about using Kubernetes at scale and handling peak loads.
Understanding of core Kubernetes concepts:
- Networking constructs such as Ingress, Service, Endpoints, and kube-dns
- Cluster level features such as namespaces
- Scale components such as Horizontal Pod Autoscaler (HPA)
- Logging components such as fluentd
Understanding of AWS-specific Kubernetes add-ons such as:
- KIAM to integrate AWS IAM with Kubernetes
Understanding of AWS concepts :
- Networking such as ALB, VPC peering, and PrivateLink
- Permission model such as IAM
- Node provisioning such as autoscaling groups
May 2019: Training and feasibility
For purposes of this blog, the teams which developed and managed the Kubernetes platform infrastructure at Intuit will be referred to as infrastructure teams/engineers, and the TurboTax and associated product teams will be referred to as service teams/developers.
TurboTax Online (TTO) ran its services on AWS EC2 for several years. In 2019, Intuit’s platform infrastructure teams decided to adopt containers and Kubernetes in order to achieve:
- Fast, iterative product development and rollout
- Consolidation under a single platform for all development teams due to native multi-tenancy support in Kubernetes
- Efficient resource utilization at high scale for cost reduction
- Strong ecosystem support
- Unified distribution mechanism for service artifacts
TurboTax developers had expertise in running their non-containerized services efficiently on EC2 instances. To gain confidence in migrating TTO services to Kubernetes, service developers had to be trained on containers, Kubernetes, and associated technologies.
For several weeks, training programs were conducted on fundamental technologies such as Docker and Kubernetes, as well as advanced concepts relating to multi-tenancy, cluster addons, load balancers, and auto scalers. Service teams were also provided with playground clusters that had fully functioning Kubernetes namespaces in order to allow them to tinker around.
August 2019: Requirements and capacity planning
TurboTax and related components comprise a total of 400 microservices. About 40 of these services were going to run on Kubernetes. The following requirements had to be met by the infrastructure:
- 26 clusters spread across two AWS regions, with each cluster using three AWS availability zones. These clusters are spread across multiple teams and business units.
- ~1,000 Kubernetes nodes across all clusters.
The infrastructure was required to support a growth ranging from 5K TPS to a max load of 300K TPS in a 2 hour timeframe.
Some services operated in one region and used the other region for DR. Some others operated with an active-active architecture across both regions, with the ability to scale up any one region for DR purposes.
All microservices were required to have an availability greater than 99.99%
Sept 2019: Onboarding services to Kubernetes
The Intuit Kubernetes platform infrastructure consists of multiple Kubernetes clusters per business segment to support the compute infrastructure needs of all applications in that segment. Clusters are grouped by business segment primarily due to the homogeneity of applications within the same segment. Applications across business segments tend to differ in their workload characteristics, operational needs, development velocity, regulatory compliance, and delivery speeds.
Upon onboarding a service to the Kubernetes platform, the service team gets
- Preproduction clusters for all pre-production services in the business segment.
- Production clusters in two regions for High Availability and Disaster Recovery for all production services in the business segment.
- Out-of-the-box support for logging, monitoring, identity & access management, and multi-tenancy.
- Access to data resources (databases and object stores) established using VPC peering or AWS PrivateLink, if necessary.
A service on-boarded to the platform is provided with a set of environments as seen in Figure 1. Environments are Kubernetes namespaces. Depending on the cluster type, a set of environments are provisioned:
- Pre-production clusters are sliced by namespaces for QA (typically used for unit tests, build pipelines, etc.) and E2E (typically used for end-to-end product tests).
- Production clusters are sliced by namespaces for staging and production.
AWS Account topology
A critical component of the migration of TTO services to Kubernetes was to ensure access to existing data. Fortunately, services had already been designed to be stateless and entirely API-driven. These prior design decisions made our migration significantly easier. Much of the data layer was already being accessed via API through NAT Gateway, though a few services did have additional resource dependencies on other AWS services for datastores, memory queues, and cache. Rather than migrate these AWS services to the AWS account that housed the Kubernetes cluster, we enabled cross-AWS account access via IAM access controls, and set up VPC peering where necessary.
For the purpose of this blog, the AWS account running Kubernetes clusters will be referred to as Kubernetes accounts, and the AWS account running data services will be referred to as data service accounts.
Figure 2 shows cross-AWS account communication between compute clusters and data service clusters. It’s worth mentioning that:
- Kubernetes clusters were created and run in the Kubernetes cluster AWS account. Application teams continued to keep their data in their existing accounts (called “Data Service AWS Account” above).
- VPC Peering was established to allow applications in Kubernetes clusters to communicate with resources like DB, Message Queues, etc. that are set up and managed in the Data Service AWS accounts.
- Application teams continued to manage access to their Data Service AWS Account and use their favorite tools for managing the data.
- Access to Data Service AWS Account was controlled through security groups and IAM roles. This is how the Kubernetes cluster AWS account gains access to resources in the Data Service AWS Account.
- This architecture also enabled service teams to keep their non-Kubernetes based setup around for DR purposes (especially in the initial days when they were tiptoeing their way into Kubernetes).
Application Topology in Compute Clusters
Each application is typically deployed into a single Kubernetes namespace. The namespace provides isolation of a service within the cluster, and each cluster can have multiple namespaces. Each service team has full access to its namespaces, and can deploy to and configure its application within that namespace. An application/service deployed into a namespace has the following four components, which is represented in Figure 3.
- A Kubernetes deployment consisting of pods running its containers
- A Kubernetes service object that routes traffic to the pods
- An Ingress object which can be configured for external access to services in the cluster
- An AWS ALB created and configured based on the Ingress object above; this ALB is created/configured by the alb-ingress-controller
To be continued …
The Infrastructure team had designed and built a robust infrastructure well-suited for TurboTax’s needs. But how well did the infrastructure scale? Did TurboTax push its limits? How did the team prepare for Tax D-Day? Find the answers to these questions and more in part 2 of this blog series.
This monumental accomplishment is a tribute to the hard work and dedication of incredibly talented individuals from across the company. Throughout the journey, Intuit TurboTax and Developer Platform engineers applied continuous testing, data-driven decisions, and focused automation to successfully tackle this audacious challenge.
About the Authors
Anusha Ragunathan is a software engineer at Intuit, where she works on building and maintaining the company’s Kubernetes Infrastructure. Anusha is passionate about solving complex problems in systems and infrastructure engineering, and is an OSS maintainer of the Moby (Docker) project. Prior to Intuit, she worked on building distributed systems at Docker and VMware. Her interests include containers, virtualization, and cloud-native technologies.
Shrinand Javadekar is software engineer in the Modern SaaS Team at Intuit, whose mission is to make Kubernetes the de facto standard for developing, deploying, and running apps at Intuit. The open source project Keiko was born from this work. In the past, Shrinand has been part of large-scale filesystem and virtualization projects at EMC and VMware. However, his most fun gigs have been working on cloud-native platforms and services at startups such as Maginatics and Applatix, and now at Intuit.
Corey Caverly is an architect in the Consumer Tax Group at Intuit working on Site Reliability Engineering. This team is focused on building tools, processes, and patterns that help produce reliable and performant customer experiences. Corey has worked everywhere from universities to biotech; his prior gig was leading a team that developed tools and services to deliver software infrastructure for robots that build DNA based on customer specifications.
Jonathan Nevelson is a software engineer in the Consumer Tax Group at Intuit focusing on Site Reliability Engineering. His primary focus is building a common platform for services to run on Kubernetes and ensuring their performance, security, and reliability. Jonathan’s prior experience includes working with and leading teams across the development stack, from frontend early in his career, to backend and distributed systems work, before finally making his way into SRE and infrastructure.
Rene Martin is a software engineer in the Consumer Tax Group at Intuit. He and the team he leads are focused on consistency, reliability, security, scalability, and performance at scale. Rene has developed his career around the Site Reliability space with a product development mindset. In his previous role, he led the team that supported highly dynamic and global infrastructure for an Internet advertisement company.