On Amazon EKS and Karpenter

Dirk Michel
14 min readJul 26, 2022

--

Introducing a mechanism that dynamically scales the Amazon EKS data plane in response to changing application requirements is a long-standing and important thing to do. Albeit the rise of serverless computing options such as AWS Fargate, an Amazon EC2-based data plane can be imperative for many applications requiring large compute node sizes or specialist compute resources, for example. And those nodes need automatic scaling.

Various Kubernetes node scaler projects have emerged over time, each with their particular history, implementation paradigms, general purpose and differentiating features, and configuration options. Some may consider the problem-space of node scaling as already solved and just make an “en-passant” choice of using a mature, well-respected, and popular project such as Cluster Autoscaler.

But node scaling may still surprise you… The concept of application-centric node scaling opens up a way to rethink the topic and re-evaluate your current choice.

The Karpenter open source project is an exponent of application-centric node scaling and this blog highlights its features that help realise a set of use-cases and create an “application developer experience” for the area of node scaling.

For those on a tight time budget: The TL;DR of the next sections is to show how platform teams can introduce Karpenter as an application-centric node scaler that enables Kubernetes application developers to directly define high availability and compute specifics with their workload manifests.

Application-centric node scaling with Karpenter

We begin by articulating use cases, based on which we derive the required Karpenter configuration. Karpenter integrates well with a GitOps way of deploying and operating Kubernetes clusters and it can flexibly accommodate a wide range of goals. We’ve come to value the following:

High Availability: We want to provide developers with the possibility of defining scheduling constraints for their applications and rely on the node auto-scaler to provision node capacity in a way that can accommodate them.

Compute guard rails: We want to provide a safe working environment for developers and protect against accidental over-provisioning or use of unwanted instances and capacity types. This kind of protection shall not overly degrade our instance use flexibility or cost-optimisation potential.

Specialist compute: We want to enable development teams with specialist compute, without creating toil for platform teams. Some machine-learning application workloads benefit from or may require GPU-based instances, for example.

Node rotation: We want to help enable a secure cluster environment by automatically replacing nodes with the latest Amazon Machine Image (AMI) and security patches. Using container-optimised AMIs such as Bottlerocket improves the node defence layer.

Node consolidation: We want to ensure that our node fleet capacity is optimised and runs with a small amount of unused resources or slack. Application workloads on clusters change over time and we want to regularly bin-pack them across a tightly-fitted fleet of worker nodes.

We can now define the appropriate configuration that we need based on our use-cases. As a sidebar, review this post to see how Karpenter contributes to cost optimisation objectives.

Application-centric objectives and use cases

But let’s do a quick recap before we get started: Karpenter’s primary job is to add nodes to handle unschedulable workloads or pods. To do so, Karpenter must first solve for the constraint requirements defined by the unschedulable workloads to identify an optimised set of nodes that satisfy those constraints. Once identified, Karpenter goes ahead and provisions the right capacity in the right place before scheduling the pods on those very nodes.

By default, Karpenter chooses from the full set of available compute capacity options of the cloud provider, but offers a range of options for constraining its node provisioning behaviour: These are categorised and often referred to as the “layers of constraints”:

  • The first layer of available constraint parameters is defined by the choice of cloud provider, in our case AWS. These are constraints that only apply when using a specific provider.
  • Next, we have Kubernetes administration layer constraints that we can apply via Karpenter so that they apply to everyone. We declare constraints at specific places, depending on the constraint parameter: We can use the karpenter-global-settings configmap; environment variables, aka controller args, which can be passed directly to the Karpenter controller container via its deployment manifest; and we have the NodePools and Ec2NodeClasses custom resource definition or CRDs for short.
  • The third layer would then be the constraints that the Kubernetes application developer defines and has control over. The developer can flexibly use Kubernetes scheduling constraints and rely on Karpenter to decide which nodes should be provisioned where. All within the guardrails provided by the second layer.

We need to be aware of these concepts and where the “configuration knobs” are that we can adjust. We’ll leverage all layers to realise our stated use-cases.

Equally, notice that we assume that the Karpenter beta version is already deployed on an Amazon EKS cluster: The Karpenter beta version is important, prior releases use different CRDs. This means that you’ve already decided on a range of things, such as the cluster add-on deployment pattern and the method by which to install the Karpenter helm chart. The Karpenter controller pods are already up, and the AWS IAM resources are created and in place that allow the controller to interact with the AWS APIs to provision Amazon EC2 instances on our behalf. You can refer to the Karpenter installation docs and the EKS Best Practices for Karpenter for further details.

Finally, we also assume that the Kubernetes platform team would generally define constraints at the NodePool level, which is clean, doesn’t require controller pod restarts, and lends itself to popular developer-centric operating models such as GitOps. We define a “default NodePool” to declare our constraints and then, when we have a good reason to do so, we can define other Nodepools as well.

Let’s do it!

High Availability

Currently, the configuration for high availability characteristics resides mostly in the hands of application developers.

At a platform level, we can’t quite configure the Amazon EKS kube-scheduler or the Karpenter controller in a way to achieve a “custom default for high availability” directly at the moment. Kubernetes, for example, does allow the configuration of a cluster-level default topology spread profile through the definition of a KubeSchedulerConfiguration resource, but this option is currently not exposed as part of the Amazon EKS managed control plane.

The application developer could request their workload replicas be spread across AWS Availability Zones (AZ) by defining topology spread constraints. This is achieved with the topologyKey value of topology.kubernetes.io/zone . Karpenter will want to see this key value and won’t comply if the value is something else, for example, zone. Kubernetes itself will spread replicas onto different available nodes by default, so there is nothing that needs to be set specifically to achieve that, but spreading Deployment replica pods across AZs for example does require an explicit statement, as shown in the below snippet.

apiVersion: apps/v1
kind: Deployment
metadata:
name: spread-across-az-1
labels:
app: web-server
spec:
replicas: 3
selector:
matchLabels:
app: web-server
template:
metadata:
labels:
app: web-server
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: "topology.kubernetes.io/zone"
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web-server
containers:
- image: nginx
imagePullPolicy: IfNotPresent
name: nginx

Then we rely on Karpenter’s node launch decisions to provision new nodes in the applicable AZs to accomplish our desired AZ spread. Remember: The Kubernetes scheduler marks workloads as unschedulable when it cannot satisfy the constraints set by the developer. Then Karpenter comes in and picks up a batch of unschedulable pods according to its batch parameter settings, interprets their constraints, provisions optimised nodes that satisfy those constraints and then schedules the pods.

In fact, Karpenter understands many Kubernetes scheduling constraint definitions that developers can use, including resource requests, node selection, node affinity, topology spread, and pod affinity/anti-affinity.

The developer can achieve a similar behaviour to topologySpreadConstraints by using pod anti-affinity definitions with a topologyKey. The next snippet shows how.

apiVersion: apps/v1
kind: Deployment
metadata:
name: spread-across-az-2
labels:
app: web-server
spec:
replicas: 3
selector:
matchLabels:
app: web-server
template:
metadata:
labels:
app: web-server
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: web-server
topologyKey: topology.kubernetes.io/zone
containers:
- image: nginx
imagePullPolicy: IfNotPresent
name: nginx
resources: {}

With this manifest Karpenter won’t place two pods with the label web-server into the same AZ, as this would not comply with the podAntiAffinity requirement.

From a platform team and Karpenter configuration perspective we can define a NodePool that explicitly states the AZs that nodes can be provisioned in, but this is more about constraining which AZs Karpenter can use and less about enforcing desired pod spreads.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
requirements:
- key: "topology.kubernetes.io/zone"
operator: In
values: ["eu-west-1a", "eu-west-1b", "eu-west-1c"]

The platform team can, however, help increase application high availability characteristics by provisioning persistent volumes (PV) in a way that enables Karpenter’s built-in persistent volume topology awareness. Karpenter can detect and consider AZ locations of Amazon EBS-backed PVs when calculating its node launch decisions. This ensures that node capacity is provisioned in an AZ that contains the specific PV that the workload requires.

For example, Karpenter can resolve and trace the volumeClaimTemplates from a StatefulSet all the way down to the actual PV, determine the PV’s AZ, and include this result when provisioning node capacity. Karpenter reads the spec.nodeAffinity field of existing PV objects that a given StatefulSet replica pod will need to access. This is the case for both statically and dynamically provisioned EBS CSI PVs. An example of a statically provisioned PV with nodeAffinity is shown below.

apiVersion: v1
kind: PersistentVolume
metadata:
labels:
failure-domain.beta.kubernetes.io/region: eu-west-1
failure-domain.beta.kubernetes.io/zone: eu-west-1a
name: my-pv-name
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 100Gi
csi:
driver: ebs.csi.aws.com
fsType: xfs
volumeHandle: vol-02e7e7a6c9604101c
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: topology.ebs.csi.aws.com/zone
operator: In
values:
- eu-west-1a
persistentVolumeReclaimPolicy: Retain
storageClassName: my-storage-class
volumeMode: Filesystem

With this manifest, platform teams can rely on Karpenter to provision nodes specifically in AZ eu-west-1a for pods that are accessing this PV. Another example based on dynamic PV provisioning is described here.

Compute guard rails

With Karpenter we enable application developers to directly request specific compute. Perhaps a developer directly declares a nodeSelector to ask for a node that matches a specific AWS instance type. Or a developer may use nodeAffinity in conjunction with a requiredDuringSchedulingIgnoredDuringExecution statement to request a node of a specific kind for their workload. In both cases, Karpenter would readily oblige and provision such nodes if they don’t already exist with the requisite amount of spare capacity.

The example pod spec snippets illustrate this:

---
[...]
spec:
nodeSelector:
node.kubernetes.io/instance-type: r5.24xlarge
[...]
---
[...]
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- r5.24xlarge
[...]

Developers can use many Well-Known Labels, Annotations and Taints that Karpenter supports to further influence Karpenter’s node provisioning behaviour.

With that flexibility in mind, we typically want platform teams to safeguard the developer environments by narrowing down what Karpenter can provision. We have two ways in achieving that, both of which can be configured with the NodePool CRD and we expand on our previous example with additional requirement statements.

One option is to define a specific list of EC2 instances that Karpenter must choose from. With the following NodePool definition, Karpenter would decline scheduling requests that fall outside of the “EC2 instance type allow-list”.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
[...]
requirements:
- key: "topology.kubernetes.io/zone"
operator: In
values: ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: [m5, r5]
- key: karpenter.k8s.aws/instance-size
operator: In
values: [nano, micro, small, large]
[...]

An alternative and perhaps more flexible option would be to use attribute based instance selection, in which case we define the attributes of the kind of worker node that we want, such as architecture type, greater-than(Gt)/lower-than(Lt) CPU and Memory quantities, and let Karpenter identify the optimal node based on the pending pods that need scheduling.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
[...]
requirements:
- key: kubernetes.io/os
operator: In
values:
- linux
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.k8s.aws/instance-cpu
operator: Gt
values:
- "7"
- key: karpenter.k8s.aws/instance-cpu
operator: Lt
values:
- "65"
- key: karpenter.k8s.aws/instance-memory
operator: Gt
values:
- "130048"
- key: karpenter.k8s.aws/instance-memory
operator: Lt
values:
- "1049600"
[...]

Specialist compute

The Karpenter docs refer to the provisioning of specialist compute instance types as hardware isolation. Most use cases are well served with a single default NodePool but multiple NodePools can be useful when isolating nodes with different node constraints.

We can define a separate NodePool per specialist compute type. When doing so we want to ensure that NodePools are mutually exclusive, as otherwise, Karpenter may find multiple suitable NodePools and randomly select one of them. We can also attach a .spec.weight value to NodePools to help with NodePool selection: The scheduler considers the weight as a priority indicator when selecting across multiple possible NodePools.

Offering specialist compute to developers will typically entail demonstrating a concern for cost-optimisation. A good practice would be to define a ceiling on the quantity of CPU and memory the NodePool can use to break runaway scaling. Defining taints can also be effective to help guard against workloads being scheduled on expensive hardware they may not require.

This is shown here.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: specialist-compute
spec:
[...]
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values: ["p3.8xlarge", "p3.16xlarge"]
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
weight: 10
limits:
cpu: "900"
memory: 9000Gi
[...]

Interestingly, Karpenter has a built-in safeguard, as it will only provision specialist compute such as p3 GPU instances if a NodePool specifically enables it.

Node rotation

Equipping an Amazon EKS data plane with a node-update mechanism can be useful. Many are familiar with updating Managed Node Groups or updating nodes when replacing a fleet as part of an EC2 retirement event. Regular node rotations are essential to mature security practices and keeping Kubernetes versions updated.

A supplementary benefit of node rotations is that developers receive a clear signal that data plane nodes are immutable and ephemeral, which encourages building applications that cater for this gracefully. Systematising node rotations can play an essential role in the software-development-lifecycle domain, as the various release channels would help identify opportunities for improving application reliability.

With Karpenter we’ve stripped away AWS abstractions such as node groups, but we can, for example, leverage its node expiry timer and the drift detection feature to force node rotations without creating unnecessary toil.

The below snippet shows a NodePool that attaches a 30-day lifespan to any nodes it provisions.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
[...]
disruption:
expireAfter: 2592000
[...]

Karpenter attempts to gracefully drain expired nodes and replaces them with nodes on the latest AMI, assuming that we requested the $LATEST version rather than a specific version.

Using node expiry to trigger the rotation method may not result in updated AMIs if none are available at the time of expiry. Still, I’d consider this a reasonable trade-off in exchange for avoiding toil.

The somewhat newer Karpenter drift feature can be an alternative to the timer-base approach to give us more control over when the fleet is rotated. This is where the EC2NodeClass definition comes in… Drift is enabled by default and can automatically identify changes we make to the EC2NodeClass definitions and any resulting differences to the nodeClaims of the running nodes. Once the drift is detected, caused, for example, by updated spec.amiSelectorTerms in the EC2NodeClass, the nodes are rotated and replaced gracefully whilst respecting Kubernetes objects such as pod disruption budgets.

There are some subtleties to observe, though, to ensure we achieve our desired AMI updates:

EC2NodeClass: Within the NodePool definition, we specify a spec.template.spec.nodeClassRef that references an EC2NodeClass custom resource, which contains our required amiFamily and other details we want Karpenter to implement when launching nodes. The following snippet shows an example.

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: bottlerocket
spec:
amiFamily: Bottlerocket
amiSelectorTerms:
- name: bottlerocket-aws-k8s-1.27-x86_64-v1.16.0-d2d9cf87
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
deleteOnTermination: true
encrypted: true
kmsKeyID: arn:aws:kms:xxx
volumeSize: 2Gi
volumeType: gp3
- deviceName: /dev/xvdb
ebs:
deleteOnTermination: true
encrypted: true
kmsKeyID: arn:aws:kms:xxx
volumeSize: 200Gi
volumeType: gp3
tags:
MyTag: "1234"
MyBackupTag: "yes"
userData: |
[settings.kubernetes]
kube-api-qps = 30
[settings.kubernetes.eviction-hard]
"memory.available" = "2.5%"
"nodefs.available" = "15%"
"nodefs.inodesFree" = "10%"
"imagefs.available" = "20%"
[settings.kubernetes.eviction-soft]
"memory.available" = "5%"
[settings.kubernetes.eviction-soft-grace-period]
"memory.available" = "5m30s"

A note on root disk: The deviceName is set to /dev/xvdb to ensure Bottlerocket mounts it correctly. If you set it to /dev/xvda, for example, the volume is created, but Bottlerocket won’t be able to use it.

With this EC2NodeClass, Karpenter launches nodes with our defined Bottlerocket OS AMI, a larger root disk, a set of additional tags, and some userData custom settings, and the drift feature replaces any nodes that have nodeClaims with differences. The following snippet shows the definition of a NodeClaim.

apiVersion: karpenter.sh/v1beta1
kind: NodeClaim
metadata:
annotations:
karpenter.k8s.aws/ec2nodeclass-hash: "xxx"
karpenter.k8s.aws/tagged: "true"
karpenter.sh/managed-by: tfge-prod
karpenter.sh/nodepool-hash: "xxx"
creationTimestamp: "xxx"
finalizers:
- karpenter.sh/termination
generateName: default-
generation: 1
labels:
karpenter.k8s.aws/instance-category: x
karpenter.k8s.aws/instance-cpu: "32"
karpenter.k8s.aws/instance-encryption-in-transit-supported: "false"
karpenter.k8s.aws/instance-family: x1e
karpenter.k8s.aws/instance-generation: "1"
karpenter.k8s.aws/instance-hypervisor: xen
karpenter.k8s.aws/instance-memory: "999424"
karpenter.k8s.aws/instance-network-bandwidth: "5000"
karpenter.k8s.aws/instance-size: 8xlarge
karpenter.sh/capacity-type: on-demand
karpenter.sh/nodepool: default
kubernetes.io/arch: amd64

Notice that not all fields of the EC2NodeClass are represented in the NodeClaim, which means that some EC2NodeClass details can change without leading to a detected drift.

Consolidation

Maintaining a tight fit between the fleet of worker nodes and application workloads has preoccupied many platform teams that seek to “defragment” clusters, especially when workloads change frequently over time. Adjusting the fleet to changing workloads is necessary when optimising for aggregate fleet-level resource utilisation.

Karpenter has long been able to remove “empty” nodes from the fleet with its consolidationPolicy set to WhenEmpty. Still, the Karpenter beta version improves on that with its new WhenUnderutilized policy option that uses two methods for consolidation: Deletion and Replacement.

The Deletion method bin-packs workloads by removing nodes when all of its pods can run on the available capacity of other remaining nodes in the cluster. The Replacement method “shrinks” a node when a smaller/cheaper node can replace it, and the excess pods fit onto the other remaining nodes. The following snippet illustrates the use of the spec.disruption.consolidationPolicy .

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 720h
budgets:
- nodes: 10%
- nodes: "0"
duration: 165h
schedule: 0 8 * * sun

Notice the spec.disruption.budgetsfield. With Karpenter disruption budgets we can define a time window during which consolidation is active and is allowed to optimise the fleet. For example, determining a time window during which consolidation can disrupt/rotate/consolidate nodes can be helpful when application workloads are not resilient to worker node changes.

Hosting application workloads that are not resilient to worker node changes forces a tradeoff: On the one hand, we want to optimise the fleet as often as possible and reduce slack; on the other hand, we aim not to disrupt applications. Disruption budgets help us navigate this tradeoff by “enabling” consolidation, for example, only during 5–8 am UTC on a Sunday, as shown in the above snippet.

Side note: The disruption budgets take an array of one or more budgets, with the most restrictive one taking precedence. The schedule has two components: the schedule follows a cron syntax and defines the start of the window during which disruptions will not occur. The duration field is used to define the length of the window.

Additionally, we can use Karpenter pod-level controls when workloads that should never be disrupted voluntarily have to be considered. Nodes that host pods with the karpenter.sh/do-not-disrupt: “true” annotations are excluded from consolidation. Application developers can add this annotation to their non-resilient but critical workloads.

Conclusions

With Karpenter, we can introduce a node scaler that implements the paradigm of doing things the “cloud-native way” that aligns with our objective of extending the “developer experience” approach to running and operating Amazon EKS clusters. Karpenter is an excellent option for platform teams that look to enable application developers to define compute specifics as part of the application manifests directly. This empowers the development community and incorporates many of the principles of DevOps and GitOps that we’ve come to value. Platform teams can also achieve higher cluster resource utilisation and lower slack without incurring the penalty of additional toil.

--

--

Dirk Michel

SVP SaaS and Digital Technology | AWS Ambassador. Talks Cloud Engineering, Platform Engineering, Release Engineering, and Reliability Engineering.