Genius of Admiral
Authors: Jason Webb, Anil Attuluri, Vijay Iyer
Intuit is the proud maker of TurboTax, QuickBooks, and Mint. We are a mission-driven, global financial platform company that gives everyone the opportunity to prosper.
The Case for Service Mesh
As a global software company, we really care about how our services communicate with each other, and this communication is the lifeblood that powers our products. We are constantly striving to reduce overhead, provide scalable security, and observe the traffic between services.
With this in mind, we found it important to bring service mesh to our enterprise for several reasons, including the ability to:
- Abstract system level security into the proxy
- Have common infrastructure to manage credential rotation
- Eliminate this burden from the development teams
- Give better visibility and control to our security teams.
- Cross cut concerns for circuit breaks and fault injection into client proxies and away from application logic
- Obtain client side observability metrics (as our current Gateway infrastructure prevented visibility into client metrics without the development teams instrumenting the application).
Started Off with a Simple Cluster, But Things Quickly Got Complicated
With promising new functionality in mind, we set off to see how we could bring service mesh to our enterprise and started with a simple single cluster deployment. The ingress used an API Gateway but the east-west communication utilized Istio.
This simple setup worked well; however as we looked at production deployments, the requirements were much more complicated than a simple single cluster install. Intuit was currently operating 160 Kubernetes clusters across multiple regions with no cross cluster L3/L4 pod connectivity. Since we used so many clusters to reduce security blast radius, compliance reasons, and HA/DR deployments, it made sense for us to pursue the multi-cluster support in Istio.
A more realistic deployment would look like this diagram below:
The services are broken down into several clusters, and service communication cross between clusters. We evaluated the different options provided by Istio and settled on using an Istio Control Plane per cluster.
With this deployment (shown above), the proxies connect to the local Istio control plane. The Istio configuration is in each control plane to enable services to communicate between clusters. Another benefit of this topology is that the Istio config is collocated with the service’s k8s deployment config, thus making troubleshooting more straightforward.
Off to a Good Start But Multi-Cluster Has Many Requirements
We were off to a good start but quickly realized the configuration for multi-cluster was complicated and would be challenging to maintain over time. Also, there were several requirements we needed to solve for including:
- Creation of service DNS decoupled from the namespace
- Service Discovery across many clusters
- Support Active-Active & HA/DR deployments with services being deployed in globally unique namespaces in clusters in different regions
Supporting Active/Active and HA/DR deployments would quickly became the most complicated of the above requirements. The namespace naming schema utilized at Intuit created a globally unique name across 160+ clusters for every namespace. This meant the same service binary deployed in different regions would be run in namespaces with different names. We needed a globally unique service DNS that was able to resolve services in multiple clusters, each with a unique k8s FQDN.
Also, services would need several DNS names for the same service with different resolution and global routing properties. For example, default.payments.global would resolve locally first, and then route to a remote location (using topology routing). Also, payments service would also need names like payments-west.global and payments-east.global to resolve the respective region but enable failover to the other region. Such names would then be used for testing in one region during deployments and for troubleshooting.
Next Step: Introduce Contextual Configuration
As we investigated how to solve this, it became apparent that configuration needed to be contextual. By contextual, the configuration could be different depending on how and where services and their clients would be deployed across clusters.
Here is an example:
- We have a payments service consumed by orders and reports
- The payments service has a HA/DR deployment across us-east (cluster 3) and us-west (cluster 2)
- Payments service is deployed in namespaces with different names in each region
- The Book Order service is deployed in different cluster as payments in us-west (cluster1)
- The reports service is deployed in the same cluster as payments in us-west (cluster 2)
In the diagram below, the Istio Service Entry yaml for payments service in Cluster 1 and Cluster2 illustrates the contextual configuration needed for other services to consume payments service:
Cluster 1 Service Entry
Cluster 2 Service Entry
The Payments ServiceEntry from the point of view of the Reports service in cluster 2, would set the locality us-west pointing to the local k8s FQDN and locality us-east pointing to the ingress load balancer for cluster 3.
The Payments ServiceEntry from the point of view of the Book Orders service in cluster 1, will set the locality us-west pointing to Cluster 2 ingress load balancer and locality us-east pointing to the ingress load balancer for Cluster 3.
But Wait, There’s More…Complexity
If all this sounds confusing its because it is…but there’s more complexity!
What if the payment services want to move traffic to the us-east region for a planned maintenance in us-west?
This would require the payments service to change configurations in all of their clients clusters updating DestinationRule locality settings. This would be nearly impossible to do without some automation.
Admiral to the Rescue: Admiral is that Automation!
Admiral automates the configuration necessary to enable service to service communication by abstracting the cluster and deployment topologies from service discovery and consumption.
Admiral provides automatic configuration for deployments spanning multiple clusters to work as a single mesh. Admiral also provides automatic provisioning and syncing of Istio configuration across clusters. This removes the burden on developers and mesh operators and helps scale beyond a handful of clusters.
Admiral Introduces a New Type to Control Global Traffic Routing
With Admiral’s global traffic policy CRD, now the payments service can update regional traffic weights and Admiral takes care of updating configuration in all clusters with the clients consuming the payments service.
In this example above, 90% of the payments service traffic is routed to the us-east region. This Global Traffic Configuration is automatically converted into Istio configuration and contextually mapped into clusters to enable multi-cluster global routing for the service and the clients within the Mesh.
Admiral provides a powerful new Global Routing functionality that was missing from the multi-cluster service mesh implementations, removes the need for manual configuration synchronization between clusters, and generates contextual configuration for each cluster. This makes operating a service mesh composed of as many Kubernetes clusters possible!