Karpenter Chronicles: A Hilarious Journey to Cost-Effective High Availability

Raghavendra Tandon
OneFootball Tech
Published in
5 min readDec 19, 2023

In the ever-evolving landscape of cloud infrastructure management, the tale of OneFootball’s transition from the conventional Cluster Autoscaler to the cutting-edge Karpenter is nothing short of an adventure.

Buckle up as we delve into the nitty-gritty details of our implementation, complete with funny anecdotes that highlight the quirks and triumphs of adopting this groundbreaking technology.

The Prologue: Cluster Autoscaler Blues

Our odyssey commenced with a singular mission: to master the delicate dance between cost-effectiveness and high availability.

Cluster Autoscaler had served us well, but the time had come to embrace the enchanting powers of Karpenter, a tool promising automation and humor in equal measure.

One fine day(Lab Day): And the story unfolds:

Alright, buckle up for a wild Lab Day tale, folks! Picture this: a bunch of tech nerds, fueled by caffeine and the audacity to innovate, decided it was high time to spice up their teamwork. Enter the hero of our story — Karpenter, the cluster autoscaling wizard!

So, here we were, donning our lab coats and brainstorming ways to level up our collaborative game. Someone shouts, “Mob programming, anyone?” and the room goes silent for a moment, as if we’d just suggested switching from keyboards to typewriters. But hey, Lab Day is all about pushing boundaries!

Now, why Karpenter, you ask? Well, our existing cluster autoscaler with spot.io was like that sluggish friend who takes forever to decide where to eat. It did the job but lacked the pizzazz. Plus, the clash of two controllers — it was like a bad sitcom plot with constant spit-brain drama. We needed something more seamless, less drama, and definitely not a separate line item on the budget for spot.io.

So, armed with the spirit of experimentation and a dash of “why not?”, we embarked on this mob programming journey. Learning how to collaborate became our team’s new hobby. Spoiler alert: it involved fewer heated debates and more laughs than an episode of a sitcom about techies trying to get along.

And thus, our Lab Day escapade unfolded, with Karpenter leading the charge into the uncharted territory of mob programming. Because who said tech talks can’t have a sprinkle of humor and a pinch of chaos? Cheers to embracing the unknown, one witty line of code at a time!

But now let’s dig deeper into how Karpenter is implemented at OneFootball and not on how it started with the experiment to do mob programming

Spot Instances: The Bargain Hunters

One of Karpenter’s standout features is its ability to harness the power of spot instances, allowing us to slash costs without compromising performance.

The first time we implemented spot instances, it felt like assembling a team of fearless adventurers ready to take on the perilous journey of unpredictable availability.

Picture this: our CTO wearing a cape, dramatically announcing, “Deploy the Bargain Hunters!” every time a pending pod in the queue.

The spot instances, our unsung heroes, would swoop in, execute their tasks, and vanish into thin air like mythical creatures, leaving only a trail of laughter in their wake.

- apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
.
.
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot", "on-demand"]
.
.

Consolidation: Karpenter’s Marie Kondo Moment

Karpenter brought a breath of fresh air to our cluster management strategy with its consolidation capabilities. Our Kubernetes clusters underwent a Marie Kondo moment — “Does this pod spark joy?” — as Karpenter efficiently packed workloads together, optimizing resource usage.

consolidation:
enabled: true

We fondly recall one of our engineer muttering “Thank you, Karpenter,” while deleting unused pods, imagining the tool as a digital tidying expert. Little did we know that tidying up our clusters would not only spark joy but also significantly reduce our cloud bills.

The TTLSecondsUntilExpired Time Travel

Implementing ttlSecondsUntilExpired for nodes was like giving them an expiration date. It was as if our nodes were now equipped with a built-in time machine – they could travel to the future and expire, ensuring that we weren't stuck with unused resources for eternity.

ttlSecondsUntilExpired: 172800 # 2 Days = 60 * 60 * 24 * 2 Seconds;

This feature not only improved resource turnover but also added a touch of sci-fi to our otherwise mundane infrastructure management.

Scaling Up with Lightning Speed

One of the standout features of Karpenter is its lightning-fast reactivity when it comes to scaling up nodes.

Picture this: a sudden surge in traffic hits your application, and instead of breaking into a cold sweat, you witness Karpenter swiftly adding nodes to your cluster.

It’s like having a superhero for your infrastructure, responding to threats with unparalleled speed.

Manifests that Understand Karpenter’s Language

Our application manifests underwent a transformation, learning to speak Karpenter’s comedic code. NodeAffinity became the secret handshake, ensuring pods found their ideal hosts with the finesse of a matchmaking reality show.

To dance the dance of availability, we decided to go beyond the basic steps. Our application manifests are now baked with NodeAffinity, ensuring that pods are scheduled on nodes provisioned by Karpenter.

Example:provisioner.yaml

- apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: core
spec:
providerRef:
name: core
labels:
nodegroup.onefootball.com/class: core

Example:pod.yaml

- preference:
matchExpressions:
- key: nodegroup.onefootball.com/name
operator: In
values:
- core

PodAntiAffinity and TopologySpreadConstraints were introduced for a touch of sitcom drama, turning our pods into strategic players in a game of high availability chess. It’s like adding a pinch of spice to an already delicious dish.

In the midst of this technical symphony, our pods became characters in a cosmic sitcom, with each manifest tweak delivering punchlines and plot twists. The backstage banter among pods planning their optimal placement added a layer of levity to our infrastructure management saga.

With these configurations, we’ve ensured that pods from the same application are spread across different nodes and zones, minimizing the impact of failures and maximizing availability.

The cost impact: Because in the end, the wallet needs to smile!

Let’s talk money, because, let’s face it, even our digital endeavors need a dose of fiscal humor!

We know that spot instances will always be the cost optimization killer. Since it’s not easy to rely on spot instances because of the availability nature of the desired instance types, you have to keep watching and monitoring the marketplace and the expiration date for the accumulated instance.

Karpenter has the native capability to handle this kind of task thanks to the price-capacity-optimized that can comprehensively cover all suitable AWS nodes from a pool of over 600+ AWS EC2 nodes. By utilizing Karpenter’s capabilities, we could effectively identify and select the most appropriate nodes based on specific attributes, ensuring optimal utilization, cost optimization, and resource allocation within our infrastructure.

Instance cost trend over 10 months

Conclusion: A Comedy of Cloud Errors Turned Triumph

In the grand comedy of cloud errors and triumphs, Karpenter emerged as our hero, wielding efficiency and hilarity in equal measure. From spot instances racing against the clock and pods speaking availability language our journey with Karpenter was nothing short of a tech sitcom.

As we bid farewell to the old ways of Cluster Autoscaler, we can’t help but marvel at the transformative power of Karpenter. Our clusters are leaner, our costs are lower, and our infrastructure is more resilient than ever — all thanks to a tool that not only understands the language of Kubernetes but also knows how to tell a good joke along the way.

Cheers to Karpenter, the unsung hero of our cloud infrastructure saga!

--

--