Sitemap
The Airtable Engineering Blog

The Airtable Engineering blog shares stories, learnings, best practices, and more from our journey to build a modular software toolkit.

Managing Kubernetes Resources Across Multiple Clusters

8 min readMay 13, 2025

--

At Airtable, we use Amazon’s Elastic Kubernetes Service (EKS) to manage Kubernetes control planes so we can focus on deploying our workloads. While Kubernetes has added new features and improved scalability since we adopted it in 2022, fault tolerance remains top of mind, especially as enterprise customers rely on us for mission-critical workflows.

Airtable regularly conducts Kubernetes control plane upgrades for security and compliance. These upgrades may introduce unexpected changes or bugs, which can disrupt workload scheduling or execution. This presents a risk because these upgrades cannot be canaries within a single cluster or gradually rolled out over time. We also upgrade operating system versions, cloud instance types, and CPU architectures. While they technically can be canaried within a cluster, that requires product engineers to know details about the nodes being used. Furthermore, many daemonsets that are part of our compute infrastructure don’t lend themselves to single-cluster canarying.

To reduce the potential impact of these changes, we shard all stateless workloads across 3 Kubernetes clusters, such that each cluster only hosts a configurable fraction of a given workload. This gives us resilience against single-cluster failures by reducing the blast radius of an infrastructure bug from 100% of user traffic to a smaller fraction and speeding up recovery by having healthy clusters take over user traffic.

Workloads that can be safely sharded are distributed among multiple clusters.

The above example shows how we split 3 different types of workload, from a single cluster setup to a multi cluster setup. For each sharable workload, the workload owner chooses the desired percentage or absolute split per cluster. The suggested split for shardable workloads is 10% on the canary cluster, 45% on the first main shard, and 45% on the second. Additionally, for most shardable workloads, we further sub-divide pods on the canary cluster into canary and baseline deployment groups, so that we can use the baseline group as control and compare against to monitor for regressions when deploying new versions of the shardable workloads. This lets us restart both deployments at the same time and control for variables unrelated to code changes, such as higher latency at startup.

Using this setup, we can perform operations such as cluster upgrade, or OS upgrade with higher confidence and lower blast radius. This strategy has undeniable advantages but also creates a new challenge: how do we coordinate code rollouts when each cluster has only a subset of the overall state?

Desired properties

Airtable uses Spinnaker for continuous delivery. Our original implementation used complex scripts in Spinnaker pipelines for each workload. This exposed us to problems that increased toil and delayed deployment of new code:

  • Timeouts, node rotations, and transient errors require manual action to restart deploy pipelines. However, doing so is risky because scripts may be non-idempotent.
  • Some workloads deploy custom resource definitions (CRDs), and we need to update our scripts every time a new resource type is deployed.
  • Deviations from the intended state persist until the next deploy (up to several weeks for some workloads), and resources orphaned from a workload need to be manually cleaned up.

We wanted any solution for the above problems to have several properties:

  1. Declarative approach: Similar to Kubernetes deployments, we would like to declare a desired state and have the system converge without a long-lived process for each workload issuing commands.
  2. Resilience to disruptions: The solution should gracefully handle node rotations and pod evictions at any time.
  3. Workload agnosticism: We should not have to push a new version of our solution every time a team wants to deploy a new workload.
  4. Automatic drift detection and correction: Any drift between desired and actual states for managed workloads should be automatically corrected.
  5. Resource lifecycle management: All Kubernetes resources related to managed workloads should be reconciled, but others should not be modified.

The multi-cluster reconciler

Our solution is a service we call the multi-cluster reconciler. The reconciler runs 3 replicas per region and uses leader election via DynamoDB to ensure only one is applying changes at a time. The leader scans DynamoDB to discover workloads to manage, then runs one instance of the reconciliation loop for every workload and gracefully exits during pod shutdown. Each leader pod only manages workloads within its region, which makes reconciliation easier compared to imperative scripts managing global state across regions.

The multi-cluster reconciler reads state from DynamoDB, S3, and cluster k8s APIs to plan next actions.

We decided on a reconciliation loop because it resembles the declarative approach of Kubernetes operators while providing drift correction and resilience to disruptions. To achieve workload agnosticism, we store metadata in Amazon’s DynamoDB about desired states for each <workload name,cluster name> pair in a region, and S3 for YAML definitions with each workload’s desired resources. To ensure all workload resources are managed, the reconciler tags all resources with their workloads (more on this in the Challenges section). For safety against bugs, resource deletions are forbidden unless explicitly allowed by a human.

type WorkloadConfigForCluster = {
clusterName: string;
namespace: string;
workloadName: string;
workloadType: 'deployment' | 'daemonset' | 'statefulset' | 'cronjob';
deploymentGroup: 'canary' | 'stable' | 'baseline';


state: 'active' | 'deleted';
allowResourceDeletions?: boolean;
trafficPercentage?: number;
autoscaling?: {
scaleDownDisabled?: boolean;
};
emergency?: {
pauseRollouts?: boolean;
additionalReplicas?: number;
};
yamlLocationInS3?: string;
previousYamlLocationInS3?: string;


appliedConfigVersion?: number;
desiredConfigVersion?: number;
restartedAt?: string;
restartReason?: string;
};

Since the multi-cluster reconciler is implemented as a Kubernetes deployment like any other, it can also deploy new versions of itself and migrate itself between clusters after an initial manual bootstrapping deployment.

Inside the reconciliation loop

At the core of the reconciliation loop lies a simple concept: fetch desired state → fetch actual state → compute next state → compute next actions → apply them. Let’s unpack each of these steps:

Fetch desired state:

  • Fetch desired configs from the DynamoDB, and keep the ones relevant to the target workload.
  • For each config, use the included S3 key to download the right YAML definition, then adjust replica counts and autoscaling min/max numbers to reflect the desired traffic percentage on each cluster.

Fetch actual state:

  • For each config, list all managed resources on the named cluster (more about this in the Challenges section), and determine if rollouts are paused on any deployments or daemonsets.
  • We configured Spinnaker to pause workload rollouts and prevent new code from deploying further when a deploy pipeline is canceled, such as when a bug is discovered.

Compute next state:

2 types of deploy strategies are handled here:

  • Single-cluster workloads live in at most 1 cluster at a time for correctness or performance reasons. When migrating between clusters, removal from the old cluster must precede deployment to the new one.
  • Shardable workloads live in 2 or more clusters for fault tolerance. Scaleups must happen before scaledowns to maintain desired replica counts.

Compute next actions:

  • If the workload is paused but shouldn’t be (or vice versa), then emit a pause/unpause action.
  • Emit an apply operation to correct any drift between actual and desired states.
  • If any resources should be deleted, emit a delete operation for them. Resources are deleted when they are no longer part of a workload’s YAML definition, or when a workload should be removed entirely from a cluster.
    — For example, if a workload’s deployed resources are [Deployment, ConfigMap, Ingress] and the state to be applied only has [Deployment, Ingress], then we detect the ConfigMap is no longer needed and emit a delete operation for it.

Ensuring safety and correctness

Setting a high bar for correctness and safety is essential, because the multi-cluster reconciler manages Airtable’s mission-critical workloads. We baked several safeguards into the design and implementation:

  • Unit tests: The core planning logic uses well-contained, extensively unit-tested functions. This enhances safety and makes refactoring safer because we have a large bank of expected behaviors to catch regressions.
  • Rate limits when scaling down workloads: When scaling down or removing a workload, the reconciler does not do so instantly. Instead, it scales down by max (3 pods, 5%) at a time and waits for pods to terminate before proceeding further. This gives automated monitoring and human operators time to intervene if needed.
An example of the reconciler slowly scaling down a workload, allowing the workload owner to notice and make adjustments.
  • Disallow workload deletions by default: A human operator must explicitly allow deletion of a workload’s resources. After the deletion has succeeded, the reconciler clears the flag so further deletes are blocked again. This prevents bugs in the planning logic from deleting production workloads en masse.
  • Feature flags: We implemented 2 feature flags to manage the reconciler:— A kill switch to stop all reconciler activity. Thankfully we have never needed to use it, but as with all escape hatches, the best time to build is before it’s needed.
    — A reconciliation mode flag that can be set to “ignore”, “readOnly”, or “reconcile” per workload. This has been used several times to exempt a workload without using the big kill-switch hammer.
  • Automated alerts: We monitor several aspects of the reconciler’s operations:
    — Delete operations present but waiting for manual approval.
    — Reconciliation is stuck: may indicate bugs in the workload, invalid YAML definitions, or a lack of cluster resources such as suitable nodes.
    — The reconciler has not been running for an extended period of time.

Challenges

Our first challenge was about linking Kubernetes resources with their workloads. The reconciler can’t rely on YAML definitions for a complete list because a resource may have been removed from it at some point. Our solution was to have the reconciler add the labels `airtableWorkloadName=<name>, deploymentGroup=<canary/baseline/stable>` on all applied resources so they can be listed by label in the future. This also helps us exempt unrelated resources because they won’t have these tags and are ignored.

Speaking of resources, the multi-cluster reconciler initially only supported built-in Kubernetes resources like deployments, daemonsets, cronjobs, statefulsets, and config maps. But some workloads such as Spinnaker also define custom resources. Banning them is not viable given they are a widely used Kubernetes concept, so we had to support custom resource definitions (CRDs), again in a workload-agnostic manner.

How does the reconciler know what CRDs are installed? Instead of a hardcoded list of resource types, it lists all resource types known to each cluster via the Kubernetes API, then iterates through them to find any resources tagged for a given workload. There are dozens of possible resource types, and this iteration does take time, but it is a simple solution that has served us well.

Results

After going live in production in early 2024, the multi-cluster reconciler now manages tens of thousands of pods on thousands of nodes, distributed across dozens of clusters in 3 geographical regions. Most workloads are deployed 4 times a week, with a few more sensitive ones twice a week, and all these changes are coordinated by the reconciler. It is now a vital part of Airtable’s compute infrastructure, and its importance will only grow as Airtable sets up new infrastructure and expands to additional regions.

Acknowledgments

Many thanks to William Ho, Malcolm Mathew, Jason Michalski, Pranesh Pandurangan, Kevin Cole, Anthony Nguyen, and others for their contributions to this project. We would also like to thank Alexander Sorokin, Xue Cai, Rodrigo Menezes, Brian Larson, Pierpaolo Baccichet, and others for their support and editorial input.

--

--

The Airtable Engineering Blog
The Airtable Engineering Blog

Published in The Airtable Engineering Blog

The Airtable Engineering blog shares stories, learnings, best practices, and more from our journey to build a modular software toolkit.

Airtable
Airtable

Written by Airtable

Organize anything you can imagine.

No responses yet