KARGO (pt 1) — Moving up the stack

Felix Rothballer
ProSiebenSat.1 Tech Blog
5 min readFeb 13, 2024

We have reached an inflection point with aging hardware and the end of life for key components of our Kubernetes clusters. These factors, coupled with the need to accelerate our pace, have led us to reevaluate our platform strategy.

ProSiebenSat.1 Tech & Services GmbH (PTS) decided to build all systems in the cloud by default. As the internal IT provider of the ProSiebenSat.1 Media SE we’re also feeling the need for speed-to-market, increased demand for flexibility, and the desire to leverage cost transparency as a catalyst for architectural improvements and resilience.

The following article explores the genesis of our current platform at PTS, the challenges it faces, and the strategic decision to transition to the cloud.

Current Platform

But first things first. Let’s take a look at where we’re now and how we got there.

The Platform Engineering & Operations team has successfully maintained a fleet of custom-built vanilla Kubernetes clusters, comprising approximately 100 bare metal nodes hosted in our data centers for the past six years. This infrastructure is serving as the backbone for an internal developer platform, catering to around a dozen development teams.

How it all started

Our current platform is built around Kubernetes and started in early 2017 intending to provide dynamic infrastructure for a big software development project.

Back then, the CNCF landscape looked meager in comparison to today, and the industry was still debating the merits of Mesos vs. Docker Swarm vs. Kubernetes. In those early days, building your own platform from scratch was a viable thing to do, as off-the-shelf solutions were limited and the few early offerings from public cloud providers were out of reach for us.

So, we basically built our own Kubernetes distribution and installer. Added in a few dozen hand-picked open-source components and PKE (ProSieben Kubernetes Engine) was born.

Over the next couple of years, it evolved into its current state, a well-rounded platform supporting the entire software development life-cycle of internal product teams working on core systems for our media processing chain.

Graphic “Building blocks of our current platform”
Building blocks of our current platform

Today’s challenges

Fast-forward to the present, we are facing several challenges:

  • The storage sub-system was EOL’ed by the vendor after a costly hardware refresh. This places the tedious task of having to migrate more than 500 block storage volumes onto our roadmap in a couple of years.
  • Soon the underlying host OS will no longer receive security updates. This would mean that the system will become increasingly vulnerable to security threats, which of course is unacceptable for a production system.
  • Half of our server fleet is due for a hardware refresh in the next 18 months. That’s confronting us with the decision of whether to go for another significant up-front investment and commit to staying on-prem with bare metal for another depreciation cycle.

In addition to those woes, the product teams that are using our platform are increasingly interested in using cloud-based products and services to compose their applications. With most of our capacity locked up in keeping the platform up-to-date we can’t put as much capacity on fulfilling our user’s needs as we would like to.

Mapping our Strategy with Wardley Maps

In an industry where accomplishing more with limited resources is the norm, we must carefully consider where we allocate our constrained engineering capacity.

The time-consuming task of maintaining our own Kubernetes distribution and installer no longer adds significant value, especially considering the vast availability of readily available products and services from public cloud providers.

This conclusion became quite obvious when we mapped out a prototypical internal developer platform on a Wardley Map. There’s more to gain for us higher up in the value chain where we’re closer to our users — developers in the product teams.

A Wardley map showing the parts of an internal developer platform

Introducing KARGO

Despite the decision to build KARGO — our new developer platform — as much as possible from readily available products and services and to focus our efforts on making it more than the sum of its parts, this is all but a green-field project.

We need to not only consider our users’ needs but also the constraints of our current reality. What follows are two of the most influential factors on the design and how we plan to address them:

  • The new platform needs to have a significantly lower maintenance effort than the current one.
  • The migration has to be easy and can only require minimal initial effort from the product teams.

Maintenance Effort

Our future platform will transition from a handful of large, shared clusters to a more fine-grained setup.

The Oprah meme where she screams “You get a cluster. You get a cluster. Everyone gets some clusters!”

In the realm of architectural trade-offs, finding the perfect balance between cost-efficiency, ease of management, resilience, and security has always been a delicate dance. Traditionally, opting for big, shared clusters seemed to offer the best compromise, delivering both ease of management and improved cost-efficiency. However, this choice comes at the expense of resilience and security considerations.

By drawing upon our experience with “everything as code”, fully embracing GitOps principles, and using Cluster API to manage the lifecycle of our AWS EKS clusters, we break free from the limitations of the past. It’s like shifting the trade-off from an “either-or” scenario to a more inclusive “and” proposition.

Migration Effort

Staying with Kubernetes for now allows us to maintain a familiar abstraction for the development teams using our platform, easing the learning curve associated with cloud adoption.

Looking ahead, this transition marks only the very first step in our organization’s journey toward optimizing applications to fully leverage the benefits of the cloud. Our goal is to provide a way with which we initially can re-platform the majority of workloads without requiring extensive re-architecting.

In the coming weeks and months, you will find even more articles about this topic. We look forward to sharing our insights, experiences, and lessons learned in the following series of blog posts as we navigate the path to our platform’s future. Stay tuned for the next installment and join us as we uncover the technical design choices of KARGO.

Thanks to the entire team for your input on the article and the great work.

More about KARGO

--

--

Felix Rothballer
ProSiebenSat.1 Tech Blog

Head of Platform Engineering & Operations at ProSiebenSat.1 Tech & Services