TurboTax moves to Kubernetes: An Intuit journey — Part 1

Anusha Ragunathan
Oct 28 · 8 min read

This blog post is co-authored by Anusha Ragunathan, Shrinand Javadekar, Corey Caverly, Jonathan Nevelson, and Rene Martin.

“In this world, nothing is certain except death, taxes and Kubernetes” — Anonymous

Image for post
Image for post

Background and scope

Each year, TurboTax sees peak usage in the months of January as many taxpayers receive refunds; April, as they file their taxes; and October, as any extensions files come due. The scale, seasonality, and strict deadlines of tax filing revealed interesting infrastructure challenges of scale and performance. As of 2020, the majority of TurboTax’s supporting services run on Intuit’s Kubernetes platform.

This is a two-part blog structured as a timeline of events detailing TurboTax’s Kubernetes journey. Part one discusses the planning and design of the infrastructure related to the migration. Part two delves into the technical roadblocks faced due to scale, resolutions made, and lessons learned. It is intended for anyone interested in building an AWS-based Kubernetes platform and learning about using Kubernetes at scale and handling peak loads.

Prerequisites

  • Networking constructs such as Ingress, Service, Endpoints, and kube-dns
  • Cluster level features such as namespaces
  • Scale components such as Horizontal Pod Autoscaler (HPA)
  • Logging components such as fluentd

Understanding of AWS-specific Kubernetes add-ons such as:

  • alb-ingress-controller
  • KIAM to integrate AWS IAM with Kubernetes

Understanding of AWS concepts :

  • Networking such as ALB, VPC peering, and PrivateLink
  • Permission model such as IAM
  • Node provisioning such as autoscaling groups

May 2019: Training and feasibility

TurboTax Online (TTO) ran its services on AWS EC2 for several years. In 2019, Intuit’s platform infrastructure teams decided to adopt containers and Kubernetes in order to achieve:

  • Fast, iterative product development and rollout
  • Consolidation under a single platform for all development teams due to native multi-tenancy support in Kubernetes
  • Efficient resource utilization at high scale for cost reduction
  • Strong ecosystem support
  • Unified distribution mechanism for service artifacts

TurboTax developers had expertise in running their non-containerized services efficiently on EC2 instances. To gain confidence in migrating TTO services to Kubernetes, service developers had to be trained on containers, Kubernetes, and associated technologies.

For several weeks, training programs were conducted on fundamental technologies such as Docker and Kubernetes, as well as advanced concepts relating to multi-tenancy, cluster addons, load balancers, and auto scalers. Service teams were also provided with playground clusters that had fully functioning Kubernetes namespaces in order to allow them to tinker around.

August 2019: Requirements and capacity planning

Compute requirements:

  • 26 clusters spread across two AWS regions, with each cluster using three AWS availability zones. These clusters are spread across multiple teams and business units.
  • ~1,000 Kubernetes nodes across all clusters.

Scale requirements:

The infrastructure was required to support a growth ranging from 5K TPS to a max load of 300K TPS in a 2 hour timeframe.

DR requirements:

Some services operated in one region and used the other region for DR. Some others operated with an active-active architecture across both regions, with the ability to scale up any one region for DR purposes.

Reliability requirements:

All microservices were required to have an availability greater than 99.99%

Sept 2019: Onboarding services to Kubernetes

Image for post
Image for post
Figure 1: Kubernetes namespaces for production and pre-production clusters

Upon onboarding a service to the Kubernetes platform, the service team gets

  • Preproduction clusters for all pre-production services in the business segment.
  • Production clusters in two regions for High Availability and Disaster Recovery for all production services in the business segment.
  • Out-of-the-box support for logging, monitoring, identity & access management, and multi-tenancy.
  • Access to data resources (databases and object stores) established using VPC peering or AWS PrivateLink, if necessary.

A service on-boarded to the platform is provided with a set of environments as seen in Figure 1. Environments are Kubernetes namespaces. Depending on the cluster type, a set of environments are provisioned:

  • Pre-production clusters are sliced by namespaces for QA (typically used for unit tests, build pipelines, etc.) and E2E (typically used for end-to-end product tests).
  • Production clusters are sliced by namespaces for staging and production.

AWS Account topology

For the purpose of this blog, the AWS account running Kubernetes clusters will be referred to as Kubernetes accounts, and the AWS account running data services will be referred to as data service accounts.

Image for post
Image for post
Figure 2: Cross-AWS account communication for data access

Figure 2 shows cross-AWS account communication between compute clusters and data service clusters. It’s worth mentioning that:

  • Kubernetes clusters were created and run in the Kubernetes cluster AWS account. Application teams continued to keep their data in their existing accounts (called “Data Service AWS Account” above).
  • VPC Peering was established to allow applications in Kubernetes clusters to communicate with resources like DB, Message Queues, etc. that are set up and managed in the Data Service AWS accounts.
  • Application teams continued to manage access to their Data Service AWS Account and use their favorite tools for managing the data.
  • Access to Data Service AWS Account was controlled through security groups and IAM roles. This is how the Kubernetes cluster AWS account gains access to resources in the Data Service AWS Account.
  • This architecture also enabled service teams to keep their non-Kubernetes based setup around for DR purposes (especially in the initial days when they were tiptoeing their way into Kubernetes).

Application Topology in Compute Clusters

Each application is typically deployed into a single Kubernetes namespace. The namespace provides isolation of a service within the cluster, and each cluster can have multiple namespaces. Each service team has full access to its namespaces, and can deploy to and configure its application within that namespace. An application/service deployed into a namespace has the following four components, which is represented in Figure 3.

  • A Kubernetes deployment consisting of pods running its containers
  • A Kubernetes service object that routes traffic to the pods
  • An Ingress object which can be configured for external access to services in the cluster
  • An AWS ALB created and configured based on the Ingress object above; this ALB is created/configured by the alb-ingress-controller
Image for post
Image for post
Figure 3: Compute cluster topology

To be continued …

Acknowledgements

About the Authors

Shrinand Javadekar is software engineer in the Modern SaaS Team at Intuit, whose mission is to make Kubernetes the de facto standard for developing, deploying, and running apps at Intuit. The open source project Keiko was born from this work. In the past, Shrinand has been part of large-scale filesystem and virtualization projects at EMC and VMware. However, his most fun gigs have been working on cloud-native platforms and services at startups such as Maginatics and Applatix, and now at Intuit.

Corey Caverly is an architect in the Consumer Tax Group at Intuit working on Site Reliability Engineering. This team is focused on building tools, processes, and patterns that help produce reliable and performant customer experiences. Corey has worked everywhere from universities to biotech; his prior gig was leading a team that developed tools and services to deliver software infrastructure for robots that build DNA based on customer specifications.

Jonathan Nevelson is a software engineer in the Consumer Tax Group at Intuit focusing on Site Reliability Engineering. His primary focus is building a common platform for services to run on Kubernetes and ensuring their performance, security, and reliability. Jonathan’s prior experience includes working with and leading teams across the development stack, from frontend early in his career, to backend and distributed systems work, before finally making his way into SRE and infrastructure.

Rene Martin is a software engineer in the Consumer Tax Group at Intuit. He and the team he leads are focused on consistency, reliability, security, scalability, and performance at scale. Rene has developed his career around the Site Reliability space with a product development mindset. In his previous role, he led the team that supported highly dynamic and global infrastructure for an Internet advertisement company.

Intuit Engineering

Thoughts from Intuit on tech, data, design, and the culture…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store