Optimizing Kubernetes scalability and cost-efficiency with Karpenter

Published in

Miro Engineering

13 min readSep 26, 2023

In this post, you’ll learn the rationale and approach taken by Miro’s Compute team to enhance Kubernetes cluster scaling and efficiency. This was achieved by adopting groupless node pools using Karpenter and helped reduce the compute costs in non-production clusters up to 60%, while increasing production resources usage efficiency up to 95%.

Miro’s Kubernetes platform

Miro enables a new way of working that allows distributed teams of any size to create the next big thing, as they collaborate to solve critical problems and accelerate innovation across their organizations. Such an impactful mission will only be complete if internally the product teams at Miro are able to deliver, run, and secure features to end users in a timely and scalable manner. To empower this journey, since early 2021, Miro has adopted Kubernetes, powered by Amazon Elastic Kubernetes Service (EKS), as the next generation of Miro’s compute platform.

Building a platform is an iterative process, with certain capabilities evolving over time. The cluster scaling, in particular, has seen significant advancements. Let’s walk through how Miro’s Kubernetes platform has evolved, now offering best-in-class compute scaling across the engineering organization.

EKS managed node groups

During the initial implementation, it was decided to adopt managed node groups and Cluster Autoscaler (CAS) for the data plane. This setup leveraged AWS building blocks, like EC2 Auto Scaling Groups (ASGs), to create group-based pools of instances as described in the image below:

Schematic showing group-based pools of instances. — Source: https://www.eksworkshop.com/docs/fundamentals/managed-node-groups/

EKS managed node groups are responsible for managing the lifecycle of the Auto Scaling Groups and Launch Templates, while the Cluster Autoscaler is responsible for increasing or decreasing the desired number of instances in the Auto Scaling group based on the current needs of the Kubernetes cluster runtime. This mechanism worked fine during the first year of the platform when the amount of workloads was considerably small and predictable. However, it didn’t take much time for the Compute team to realize that using such scaling architecture won’t scale and wouldn’t work well longterm.

Biggest challenges when using native EKS autoscaling

Computational requirements vary from one workload to another. Some workloads require more CPU, and others more Memory. Trying to accommodate heterogeneous workloads in group based Auto Scaling Groups is time consuming and error-prone work, particularly because it is not recommended to mix different instance sizes, like medium, xlarge, 2xlarge, etc, in the same group as documented here in the EKS Best Practices Guide. In the end, it’s possible to play safe and choose large instances; however, that could potentially create cost issues where some significant percentage of the instance resources are kept idle and un-allocatable. Creating multiple groups, each with instances of the same size, is also an option, but that introduces more moving parts to the cluster configuration, increasing the overall complexity.
Running spot instances was unstable. At Miro, the goal is to run the non-production clusters almost completely on Spot instances; however, that is challenging when using Auto Scaling Groups. The key to success for running Spot instances is to diversify the instance types and size you are using, which is hard with ASGs, where you have to specify upfront all the types of instances you would like to use. Spot availability changes over time and by defining a static set of types, the clusters were constantly running out of capacity due to changes in the Spot offer from the AWS side.
Operations and maintenance on Auto Scaling Groups and EKS Managed Groups are incredibly slow. ASGs take a significant amount of time to respond to scaling events and to recover from an unavailability in the offer of a certain instance type; this is mainly observed when the ASG allows spot instances as a purchase option. On top of that, day-two operations on Managed Groups, like rotating nodes and updating AMIs, are also quite slow, which increases the work needed on cluster maintenance.
Some workloads require special hardware capabilities, like Compute, Memory, or Graphics Acceleration (GPU). Enabling such workloads is not flexible enough in a group-based architecture. Product teams enabling their workloads had to constantly be in contact with the Compute team, requesting new groups with specific taints and node labels. Ultimately, Miro’s compute team aims to give product teams the freedom to choose the perfect instance type for their needs, without the hassle of having to modify the cluster setup every time they make a change.
Scaling ASGs from and to zero was quite troublesome and required a lot of configuration. One of the goals for a cost-effective compute platform is to terminate nodes when they are not needed anymore. The Cluster Autoscaler supports scaling groups to zero; however, that comes with complex requirements, particularly related to the alignment of AWS Tags in the instances and Auto Scaling Group.
Using Persistent Volumes with the Cluster Autoscaler also comes with its own challenges, since once again, aligning AWS Tags in the instances and Auto Scaling Group is needed. Additionally, because EBS volumes are bound to a specific Availability Zone (AZ), a different Group must be created on every AZ.

Karpenter to the rescue

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity. The project is maintained and supported by AWS and has been around since early 2020. At Miro, the Compute team started paying attention to it, but for a while it didn’t become a priority. The reason for looking into a better cluster scaling solution was when it became impossible to reliably run the non-production clusters on Spots instances. Even with about fifteen or so instance types defined in the ASGs, when AWS ran low with the spot offer of certain instances, the ASG was slow to recover from it and try another instance, which severely impacted the availability of the non-production environments. At a certain point, it was necessary to switch those clusters to on-demand instances until a more reliable solution was in place.

After reviewing the Kubernetes Node Autoscaler landscape, it was decided to give Karpenter a try. In the middle of 2022, the project was already two years old, with solid features and adoption, a passionate community, and great AWS sponsorship, which makes it a consistent player for enterprise adoption.

Karpenter is a robust open-source cluster autoscaler that efficiently manages the cluster resources. It automatically provisions new nodes when unschedulable pods arise, ensuring a seamless and scalable operation. One of Karpenter’s key strengths lies in its intelligent resource allocation: By evaluating the aggregate resource requirements of pending pods during a configurable batch window, it makes informed decisions to select the most suitable instance types, optimizing resource utilization.

Furthermore, Karpenter showcases responsible resource management by proactively scaling down or terminating instances that host only daemonset pods, thereby reducing resource waste and promoting overall efficiency. Additionally, its “consolidation” capability actively optimizes pod placement and, when required, replaces nodes with more cost-effective alternatives, resulting in cost reduction and improved cluster efficiency.

In summary, Karpenter significantly enhances cluster management strategy, enabling a well-balanced and cost-effective infrastructure. Its intelligent resource handling and waste reduction mechanisms make it an invaluable tool in the cluster management toolkit. Here’s the summary of Karpenter core functionality:

Watching for pods that the Kubernetes scheduler has marked as unschedulable;
Evaluating scheduling constraints (resource requests, nodeselectors, affinities, tolerations, and topology spread constraints) requested by the pods;
Provisioning nodes that meet the requirements of the pods;
Removing nodes when they are no longer needed.

Introducing the Provisioner concept

Karpenter introduces a powerful concept called “Provisioner” which is a Custom Resource (CRD) designed to specify provisioning configurations. Each Provisioner is responsible for managing a distinct set of nodes, while pods have the flexibility to be scheduled to any Provisioner that aligns with their scheduling constraints. Within a Provisioner, various constraints come into play, impacting the nodes that can be provisioned and defining specific attributes of these nodes, such as the time to live (TTL) of a node, priority and weight/priority.

Here is a practical example of the most common Provisioner fields. For a comprehensive understanding of the provisioning process and to explore practical examples of Provisioners in action, you canrefer to the Provisioning documentation:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  # Enables consolidation which attempts to reduce cluster cost by both removing un-needed nodes and down-sizing those
  # that can't be removed.  Mutually exclusive with the ttlSecondsAfterEmpty parameter.
  consolidation:
    enabled: true

  # Karpenter provides the ability to specify a few additional Kubelet args.
  # These are all optional and provide support for additional customization 
  # and use cases.
  kubeletConfiguration:
    containerRuntime: containerd
    kubeReserved:
      pid: 1000
    systemReserved:
      pid: 1000
  limits:
    resources:
      cpu: 2000

  # References cloud provider-specific custom resource, 
  # see your cloud provider specific documentation
  providerRef:
    name: default

  # Requirements that constrain the parameters of provisioned nodes.
  # Operators { In, NotIn } are supported to enable including or excluding values
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values: [spot, on-demand]
  - key: karpenter.k8s.aws/instance-size
    operator: NotIn
    values: [nano, micro, small, medium, large]
  - key: karpenter.k8s.aws/instance-category
    operator: NotIn
    values: [inf, g]
  - key: kubernetes.io/arch
    operator: In
    values: [arm64, amd64]
  - key: kubernetes.io/os
    operator: In
    values: [linux]

  # Configures the maximum time to live for the node, after that time, the node will be drained and terminated
  ttlSecondsUntilExpired: 2628000

  # Provisioned nodes will have these taints
  # Taints may prevent pods from scheduling if they are not tolerated by the pod.
  taints:
    - key: example.com/special-taint
      effect: NoSchedule

  # Labels are arbitrary key-values that are applied to all nodes
  labels:
    billing-team: my-team

  # Annotations are arbitrary key-values that are applied to all nodes
  annotations:
    example.com/owner: "my-team"

  # Priority given to the provisioner when the scheduler considers which provisioner
  # to select. Higher weights indicate higher priority when comparing provisioners.
  # Specifying no weight is equivalent to specifying a weight of 0.
  weight: 10

Node Templates and AWS related settings

The AWSNodeTemplate is another custom resource introduced by Karpenter, which is part of the provider configuration to enable AWS specific settings. Each Provisioner must reference an AWSNodeTemplate using “spec.providerRef”, multiple Provisioners may point to the same AWSNodeTemplate. Here you can find the skeleton of that resource and more information can be found in the documentation:

apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  subnetSelector: { ... }        # required, discovers tagged subnets to attach to instances
  securityGroupSelector: { ... } # required, discovers tagged security groups to attach to instances
  instanceProfile: "..."         # optional, overrides the node's identity from global settings
  amiFamily: "..."               # optional, resolves a default ami and userdata
  amiSelector: { ... }           # optional, discovers tagged amis to override the amiFamily's default
  userData: "..."                # optional, overrides autogenerated userdata with a merge semantic
  tags: { ... }                  # optional, propagates tags to underlying EC2 resources
  metadataOptions: { ... }       # optional, configures IMDS for the instance
  blockDeviceMappings: [ ... ]   # optional, configures storage devices for the instance
  detailedMonitoring: "..."      # optional, configures detailed monitoring for the instance

Cluster consolidation

Consolidation can be enabled per Provisioner, and when enabled, Karpenter employs a pair of methods to reduce the waste of resources:

The “Node Deletion” approach involves evaluating whether all pods on a given node can effectively utilize available resources on other nodes within the cluster. If they can, the node becomes a potential candidate for deletion.
The “Node Replacement” technique comes into play when a node’s pods can operate on both the vacant capacity of other nodes and a single replacement node, which is more cost-effective. This prompts the node’s replacement.
Consolidation efforts unfold in a sequence of actions:

Initiating with “Empty Node Consolidation”, nodes that are entirely vacant are concurrently removed.
The subsequent step is “Multi-Node Consolidation”, in which the simultaneous removal of two or more nodes is considered. This might involve introducing one replacement node that proves to be more economically efficient than removing all the nodes.
Finally, “Single-Node Consolidation” is contemplated. Here, the focus is on individual node removal, potentially followed by the introduction of a sole replacement node that offers improved cost efficiency.

Given the complexities of multi-node consolidation, Karpenter doesn’t analyze every conceivable option due to impracticality. Instead, it relies on a heuristic strategy to identify nodes that are likely suitable for consolidation. Conversely, for single-node consolidation, each node’s potential for removal is assessed in isolation.

In specific scenarios, the pursuit of node consolidation can adopt an assertive and potentially disruptive approach. This involves the repeating relocation of workload pods across nodes, which might introduce events of downtime. As a result, it becomes very important to stick to the later best practices concerning Kubernetes workloads:

Replica Redundancy: To ensure continuity and resilience, it’s recommended to deploy multiple replicas of the workload. This strategy guarantees high availability even when one of the replicas is subjected to node repositioning.
Spread Constraints: It’s important to ensure that replicas of the workload are spread evenly across nodes and failures zones, which could be achieved via Pod Topology Spread Constraints or inter-pod anti-affinity, so that Karpenter won’t try to reconcile all replicas into the same node, creating a potential availability disruption if the node goes down.
PodDisruptionBudget Implementation: Defining a precise PodDisruptionBudget for the workload is advised. This mechanism serves to prevent the simultaneous removal of all replicas from a service, mitigating the risk of service interruption.
Graceful Termination Handling: It is crucial to ascertain that the application is equipped to gracefully manage termination signals. This capability ensures that the application can shut down without adversely affecting ongoing in-flight tasks.

Adhering to these practices fosters a more controlled and dependable environment for Kubernetes workloads, minimizing the potential for operational disruptions resulting from aggressive consolidation strategies.

How Miro implements Karpenter

As shown in the diagram below, the architecture of the Customer VPC for the cluster will change when adopting Karpenter. It was decided to keep one Managed Node Group dedicated to Karpenter via taints and node labels, where the Karpenter pods will be running. This node group is statically configured to always have two instances running, and from there Karpenter will start to launch ad-hoc EC2 instances with the right instance type and size in the desired AZs and subnets. It looks like this:

Schematic showing Karpenter pods andEC2 instances as implemented by Miro.

At Miro, Terraform is used to manage the Infrastructure as Code needed for creating and managing the baseline configuration of the EKS clusters. The Karpenter installation and configuration is done using Terraform as well, via a combination of the AWS, Helm, and Kubernetes providers.

The code snippet below highlights the key points in the Karpenter installation:

locals {
  service_account_name = "karpenter"
}

provider "aws" {
  region = var.aws_region
}

provider "aws" {
  alias  = "ecr-public"
  region = "us-east-1"
}

data "aws_ecrpublic_authorization_token" "token" {
  provider = aws.ecr-public
}

data "aws_caller_identity" "current" {}

resource "helm_release" "karpenter" {
  name       = "karpenter"
  namespace  = var.namespace
  wait       = true
  skip_crds  = false
  repository = "oci://public.ecr.aws/karpenter"
  chart      = "karpenter"
  version    = "v0.27.0"
  timeout    = 900

  repository_username = data.aws_ecrpublic_authorization_token.token.user_name
  repository_password = data.aws_ecrpublic_authorization_token.token.password

  values = [<<EOF

nodeSelector:
  function: karpenter
  kubernetes.io/arch: ${var.karpenter_arch}

tolerations:
  - key: karpenter
    operator: Exists

priorityClassName: system-cluster-critical

topologySpreadConstraints: 
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule

serviceAccount:
  name: "${local.service_account_name}"
  annotations:
    eks.amazonaws.com/role-arn: ${module.karpenter_irsa.iam_role_arn}

settings:
  aws.interruptionQueueName: ${module.sqs.sqs_queue_name}
  aws: 
    defaultInstanceProfile: ${aws_iam_instance_profile.karpenter.name}
    clusterName: ${var.cluster_id}
    clusterEndpoint: ${var.cluster_endpoint}

controller: 
  env:
   - name: AWS_REGION
     value: ${var.aws_region}
  resources:
    requests:
      cpu: 1
      memory: 256Mi
EOF 
  ]
}

module "karpenter_irsa" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
  version = "5.11.2"

  role_name                          = "karpenter-controller-${var.cluster_id}"
  attach_karpenter_controller_policy = true

  karpenter_tag_key                       = "karpenter.sh/discovery/${var.cluster_id}"
  karpenter_controller_cluster_id         = var.cluster_id
  karpenter_controller_node_iam_role_arns = [var.node_role_arn]
  karpenter_sqs_queue_arn                 = module.sqs.sqs_queue_arn

  oidc_providers = {
    ex = {
      provider_arn               = var.oidc_provider_arn
      namespace_service_accounts = ["${var.namespace}:${local.service_account_name}"]
    }
  }
}

resource "aws_iam_instance_profile" "karpenter" {
  name = "KarpenterNodeInstanceProfile-${var.cluster_id}"
  role = var.node_role_name
}

Some considerations about the Karpenter installation:

IAM roles for service accounts can be used to grant fine-grained and secure access to the Karpenter Pods to the AWS EC2 APIs.
For Spot interruptions, the Provisioner will start a new machine as soon as it sees the Spot interruption warning. Karpenter enables this feature by watching an SQS queue which receives critical events from AWS services which may affect the nodes. For this, Karpenter requires that an SQS queue be provisioned, and that EventBridge rules and targets be added which forward interruption events from AWS services to the SQS queue.

Conclusion

Karpenter is a fantastic open-source tool that takes care of things behind the scenes for our cluster. When there are pods that can’t find a place to run, Karpenter swoops in and automatically creates new nodes to handle them. It’s like a resource magician!

One of the coolest things about Karpenter is how smart it is with resource management. It looks at all the pending pods and figures out the best instance type to handle them efficiently. This way, we can make the most of our resources and keep things running smoothly.

But that’s not all! Karpenter is also very conscious about not wasting anything. If it sees any instances with only daemonset pods running on them, it scales them down or gracefully terminates them. No wasting precious resources here!

And wait, there’s more! Karpenter even has a neat feature called “consolidation.” It actively rearranges pods and, if needed, replaces nodes with more cost-effective versions. It’s like having an intelligent cost-saver on our team!

Overall, the addition of Karpenter to our cluster management toolkit is proving to be highly impactful. It plays a significant role in maintaining a harmonious, streamlined, and cost-efficient operation. Let’s extend our appreciation to Karpenter and delve into its key features and the advantages it has brought to Miro’s Kubernetes Platform.

Major benefits of introducing Karpenter

Enhanced Cluster Efficiency: Karpenter has elevated our Kubernetes cluster efficiency to an impressive 95%, effectively curbing resource wastage and optimizing compute expenses.
Stable Spot Instance Utilization: Particularly in non-production environments, Karpenter has ensured remarkable stability in our usage of Spot instances. These clusters predominantly rely on Spot instances without encountering disruptions or noticeable disturbances. Karpenter uses the price-capacity-optimized allocation strategy, which tells the fleet to find the instance type that EC2 has the most capacity while also considering price. This allocation strategy will balance cost and decrease the probability of a spot interruption happening in the near term.
Tailored Flexibility for Teams: Karpenter has introduced remarkable flexibility for teams with unique computational needs and preferences for instance types. These teams now have the freedom to incorporate well-known node labels into their workloads, enabling them to specify their exact resource requirements.
Automated Node Refreshment: By setting a maximum lifespan of 30 days for nodes, Karpenter has instituted automatic node rotation. This ensures that nodes consistently adopt the latest AMI for the OS family, guaranteeing the prompt adoption of crucial security patches at the OS level.

We’ll continue to feature Miro Engineering stories on our blog, so you can get a glimpse behind the scenes and the impact Miro engineers have on the product. Be sure to follow us to get reminders in your inbox when we post about engineering culture, technology issues, and product developments.

Are you interested in joining our team? Then check out our open positions. Finally, don’t forget to try Miro or the Miro Developer Platform, where you can build apps on top of Miro.