KARGO (pt 2)— A container focused developer platform on AWS

Fabian Mueller
ProSiebenSat.1 Tech Blog
8 min readMar 19, 2024

In this blog post, we would like to give a rough overview of the development of a new product called “KARGO”, which is intended to replace an existing Kubernetes-based platform in our ProSiebenSat.1 Datacenters. This blog post will also highlight the challenges and weaknesses of the current platform, including issues with cluster structure, capacity planning, and cost transparency, alongside hopes for improvement through the new approach.

Starting point

Let us start with some background information. We do not start with a green field approach. Since 2016/2017, we have already developed and maintained a Kubernetes-based platform in our datacenters with bare metal servers and some other components.

Figure 1 — PKE Overview (Picture by the author, drawn with Miro)

From the start we tried to apply product management to internal platforms which helped us find out what our customers (internal product / development / operation teams) really need to build, deploy, maintain and observe their workloads. The On-Prem approach works fine for most cases, but it also has its limitations.

A new idea was born

Therefore, we launched a new Product called “KARGO” which should feel like the old platform but is completely new and hosted on top of AWS. The challenge for the engineering team is to make the migration / transition into the cloud as smooth as possible in a short timeframe.

Figure 2 — Migration Effort (Picture by author, drawn with Miro)

In this picture we tried to show two different approaches. The first is focused on using as much cloud native services and managed offers as possible. The second is focused on using Kubernetes (here EKS) and other tooling which we are already using with our current approach On-Prem. The yellow bar can be interpreted as ramp up time and the blue box should be the state where a customer or a workload reaches a level of experienced cloud usage. With that in mind and the fact that the On-Prem platform hosts a lot of very business critical services we decided to go with the second approach.

What we learned from the past

With this migration, we also have the chance to improve and let our experience with the On-Prem stack influence the new design. Especially, when it comes to cluster design and the separation and isolation of workflows. Currently, we have very big, multi-tenant clusters based on business domains (content, data, general infra services, etc.) The only isolation is provided by Kubernetes Namespaces and RBAC policies.

In addition to that, there are no development clusters for each domain. This means that non-production and production workloads share the same infrastructure. This could cause “Noisy Neighbors” and might potentially lead to problems in production. Just for the record we have a development system, but this is only used by the platform team to test updates to Kubernetes and other services before they get pushed to workload clusters. Another problem with big clusters is the blast radius. One mistake from the infrastructure teams can blow up a cluster which might end in a big outage of many workloads.

We not only had technical challenges. Capacity planning was really difficult with a traditional “plan your budget one-year in advance” approach. Most project teams which wanted to launch new applications can’t tell us how many CPU cores and memory they will need in the end. This makes it very hard to tell how many and what type of bare metal servers we need in the future. Also, it takes a lot of effort to buy the servers, walk through the procurement and retail process, place the servers into the datacenters, get them cabled and so on. This process can take three up to six months until compute power can be utilized.

Another non-technical problem is the cost transparency. Currently, we can’t really say what a workload costs on a monthly or even yearly basis. This is extremely difficult to calculate, especially when you have shared infrastructure components and multiple teams involved.

New is always better?

Now let us look at how the new approach will change many of these above-mentioned topics.

Figure 3 — Cluster Factory (Picture by the author, drawn with Draw.io)

The basic idea here is that we can create and maintain a fleet of clusters in an easy and automated fashion. Therefore, we build a so-called “cluster factory” which resides in a central AWS account and can create clusters in other (workload) AWS accounts. To achieve this, we make use of Kubernetes to create and maintain other Kubernetes clusters. We will use an open-source project called Kubernetes Cluster API aka “CAPI”. A cluster can be defined with the help of Kubernetes Manifest files. These files can be parameterized and packaged with Helm. At the end, we can deploy and maintain a cluster with only one command. More about CAPI, how it works and how we integrated it and use it will follow in a separate blog post.

With the cluster factory in place, we can make it possible to “crush” the big clusters. Every team has now the option to get one or more AWS accounts with a cluster pre-installed. As best practice, we recommend at least two accounts. With this approach it would be possible to split production and non-production workloads. We would get cost transparency which can be optimized to show all aspects for interests, e.g., costs of a product, workload, the non-production environments in comparison to production and so on. We lower the blast radius since we have many clusters split across many accounts. If a cluster in one AWS account has a problem, the impact is not that big compared to the On-Prem approach.

Another benefit of this approach is that we are now comfortable giving our customer teams AWS / Kubernetes administrator rights. Now we have a one-to-one relation between an AWS account and a customer team. Since the customer teams work with a DevOps mindset, they are also responsible for the operations of their workloads. With Kubernetes administrator rights, the teams are now able to interact with cluster scoped objects (before they only could interact with namespace scoped objects). This makes it possible for them to use Kubernetes operators which makes it easy to deploy popular open-source software stacks on top of Kubernetes. In addition, they can now use Customer Resources (CRDs) which can be used to extend the Kubernetes API. This pattern is also used by many operators.

Another aspect of the cluster factory is cluster add-on management. As you might know Kubernetes itself can do so many things but in its core it’s “only” a container orchestrator and needs additional software to integrate smoothly with all the underlying infrastructure. For example, an Ingress controller which can create AWS load balancers (NLB / ALB) for exposing workload to the outside of the cluster, or tooling for secret management. Our current approach to deal with this was a simple GitlabCI pipeline which deploys the needed software, such as Helm Charts. Since we had a relatively static number of clusters, this approach was quite okay. With the new approach, we will have more clusters and dynamics. Our GitlabCI pushed based approach would become unmanaged because we would need to add / remove every cluster to every pipeline for each software which we need. Therefore, we started to look at the concept of GitOps. The idea behind it is that we change the push-based approach to a pull based. The state of the additional software we need for every cluster is defined in one Git repository.

Figure 4 — Flux Overview (Picture from unknown author on https://fluxcd.io)

Our GitOps tool of choice is FluxCD which runs in each cluster. Flux gets a repository as reference and will install the software based on the parameters defined there. The tool is also capable of checking for changes based on all kinds of Git events like, new commits to a branch, checking for tags and so on. This makes it possible with one push to a Git repo to trigger an update to all clusters. Hence, it is very easy for us to manage the additional software on many clusters.

If you want to learn more about GitOps and Flux, stay tuned. There will be an additional blog post where we describe our current concept for rollouts, the GitFlow model we chose and how we structured our Git repository to reflect different environment so-called cluster flavors.

The challenges

With this new approach we also have some challenges. It’s very important to solve these in a smart way to make the approach a success. At the end of our migration, we will likely end up with a lot of clusters. Our goal here should be that the maintenance of these clusters takes less or equal amount of our time in comparison to our current approach. This means that automation is key. We quickly realized we must change how we treat a cluster. A good analogy for this is the pet vs. cattle pattern. Our current clusters are our pets. They have names and have different kinds (e.g. cat, dog, hamster). For our new approach we need cattle. This means every cluster needs to be the same for the most part. But there are still environment specific settings which need to be injected dynamically during the cluster creation / maintenance.

Another thing which can be quite complex and needs attention is communication between workloads. With the current On-Prem setup, a business domain resides in one big cluster. Each cluster has an overlay network which is basically a soft network isolation. Direct communication from the outside of the cluster is only possible when a workload is exposed to the outside (which must be done explicitly). Communication inside the cluster is easy and very fast. When we crash into big clusters, the communication between workloads will remain the same, and it can and will happen that workloads need to talk over the border of an AWS account. Depending on the amount of data, which needs to be transferred between workloads and its sensitivity, we need to decide if we need to make it possible to establish a VPN peering between the accounts or use (service / network) mesh. We are still researching what solutions can help us here and how they can be integrated and so on.

In the coming weeks and months, you will find even more articles about this topic. We look forward to sharing our insights, experiences, and lessons learned in the following series of blog posts as we navigate the path to our platform’s future. Stay tuned for the next installment and join us as we uncover the technical design choices of KARGO.

Thanks to the entire team for your input on the article and the great work.

Related Blog Posts:

--

--