Lessons From Our 8 Years Of Kubernetes In Production — Two Major Cluster Crashes, Ditching Self-Managed, Cutting Cluster Costs, Tooling, And More
Early on at Urb-it, before I joined, we decided to use Kubernetes as a cornerstone for our cloud-native strategy. The thinking behind this choice was our anticipated rapid scaling, coupled with the desire to leverage container orchestration capabilities to get a more dynamic, resilient, and efficient environment for our applications. And with our microservice architecture, Kubernetes fitted well.
Early Decision
The decision was made early, which, of course, should be questioned since it represents a significant dependency and a substantial amount of knowledge to carry for a startup (or any company). Also, did we even face the problems Kubernetes solves at that stage? One might argue that we could have initially gone with a sizable monolith and relied on that until scaling and other issues became painful, and then we made a move to Kubernetes (or something else). Also, Kubernetes was still in early development. But let’s go deep on this another time.
8 Years In Production
Having run Kubernetes for over eight years in production (separate cluster for each environment), we’ve made a mix of good and not-so-good decisions. Some mistakes were simply a result of “otur när vi tänkte” (bad luck in our decision-making), while others originated from us not entirely (or not even partly) understanding the underlying technology itself. Kubernetes is powerful, but it also has layers of complexity.
We went head-on without any previous experience of running it at scale.
Migrating From Self-Managed On AWS To Managed On Azure (AKS)
For the first years, we ran a self-managed cluster on AWS. If my memory serves me well, we didn’t have the option initially to use Azure Kubernetes Service (AKS), Google Kubernetes Engine (GKE), Amazon Elastic Kubernetes Service (EKS) since they didn’t provide a official managed solution yet. It was on Amazon Web Services (AWS) self-hosted we had our first and most horrible cluster crash in Urb-it history, but more on that later.
Since we were a small team, it was challenging to keep up with all the new capabilities we needed. At the same time, managing a self-hosted cluster required constant attention and care, which added to our workload.
When managed solutions became generally available, we took some time to evaluate AKS, GKE, and EKS. All of them were multiple times better for us than managing it ourselves, and we could easily see the quick ROI with moving.
Our platform back then was 50% .Net and 50% Python, and we were already using Azure Service Bus, Azure SQL Server, and other Azure services. Therefore, moving our cluster to Azure would not only make it easier to use them in an integrated fashion but also benefit us by utilizing the Azure Backbone Networking Infrastructure, avoiding the costs associated with leaving/entering external networks and VNETs, which we had between our mixed AWS and Azure setup. Also, many of our engineers were familiar with Azure and its ecosystem.
We should also mention that for our initial setup on AKS, we didn’t have to pay for the control plane nodes (master nodes) — which, was an extra bonus (saving money on nodes).
We migrated during the winter of 2018, and even though we have encountered some issues with AKS over the years, we have never regretted the move.
Cluster Crash #1
During our self-managed time on AWS, we experienced a massive cluster crash that resulted in the majority of our systems and products going down. The Root CA certificate, etcd certificate, and API server certificate expired, which caused the cluster to stop working and prevented our management of it. The support to resolve this, at that time, in kube-aws was limited. We brought in an expert, but in the end, we had to rebuild the entire cluster from scratch.
We thought we had all the values and Helm charts in each git repository, but, surprise, surprise, that wasn’t the case for all services. On top of this, none of the configurations for creating the cluster were stored. It became a race against time to set up the cluster again and populate it with all the services and products we had. Some of them required reinventing the Helm charts to create the missing configurations. There were moments like Dev1 to Dev2: “Do you remember how much CPU or RAM this service should have, or what network and port access it should have?”. Not to mention all the secrets that were gone with the wind.
It took us days to get it up and running again. Not our proudest moment, to say the least.
Thanks to our proactive communication, by maintaining transparency, honesty, and nurturing our relationships, we didn’t lose any business or customers.
Cluster Crash #2
And now you might say: the second crash couldn’t have been due to a certificate, since you must have learned your lesson from the first crash, right? Yes and no. When recreating the cluster from crash #1, unfortunately, the specific version of kube-aws that we used had an issue. When it created the new clusters, it didn’t set the expiration of the etcd certificate to the provided expiry date; it defaulted to one year. So exactly one year after the first cluster crash, the certificate expired, and we experienced another cluster crash. However, this one was easier to recover from; we didn’t have to rebuild everything. But it was still a weekend from hell.
Side note 1: Other companies were also affected by this bug the same way we were, not that it helped our customers…
Side note 2: Our plan was to update all the certificates after a year, but to give ourselves some margin, we set the expiration to two years (if I remember it correctly). So we had plans to update the certificates, but the bug beat us to it.
Since 2018, we have not had any more cluster crashes… Jinxing it? Yes.
Learnings
- Kubernetes Is Complex
You need engineers who are interested in and want to work with the infrastructure and operations aspects of Kubernetes. In our case, we needed a couple of engineers who, in addition to their regular duties, would devote their time to Kubernetes as the “go-to” experts whenever necessary. The workload for Kubernetes-specific tasks varied, as you might imagine. Some weeks there was almost nothing to do, while others required more attention, such as during a cluster upgrade.
It was impossible for us to rotate and split the work over the entire team; the technology is too complex to “jump in and out of” every second week. Of course, everyone needs to know how to use it (deploy, debugging, etc.) — but to excel in the more challenging aspects, dedicated time is necessary. Additionally, it’s important to have someone who leads with a vision and has a strategy for evolving the cluster. - Kubernetes Certificates
Having experienced two cluster crashes, both due to certificates expiring, it’s crucial to be well-versed in the details of internal Kubernetes certificates and their expiration dates. - Keep Kubernetes & Helm Up To Date
When you fall behind, it becomes expensive and tedious. We always waited a couple of months before jumping on the latest version to ensure that others would face any new version issues first. But even with keeping it up to date, we faced many time-consuming rewrites of configurations files and charts due to new versions of Kubernetes and Helm (Kubernetes API’s going from alfa to beta, beta to 1.0, etc.). I know Simon and Martin loved all the Ingress changes. - Centralized Helm Charts
When it came to the Helm charts, we grew tired of updating all 70+ charts for each version change, so we adopted a more generic “one chart to rule them all” approach. There are many pros and cons to a centralized Helm charts approach, but in the end, this suited our needs better. - Disaster Recovery Plan
I can’t emphasize this enough: make sure to have ways to recreate the cluster if needed. Yes, you can click around in a UI to create new clusters, but that approach will never work at scale or in a timely manner.
There are different ways to handle this, ranging from simple shell scripts to more advanced methods like using Terraform (or similar). Crossplane can also be used to manage Infrastructure as Code (IaC) and more.
For us, due to limited team bandwidth, we settled on storing and using shell scripts.
Regardless of the method you select, make sure to test the flow from time to time to ensure you can recreate the cluster if needed. - Backup Of Secrets
Have a strategy for backing up and storing secrets. If your cluster goes away, all your secrets will be gone. And trust me, we experienced this first-hand; it takes a lot of time to get everything right again when you have multiple different microservices and external dependencies. - Vendor-Agnostic VS “Go All In”
In the beginning, after moving to AKS, we tried to keep our cluster vendor-agnostic, meaning that we would continue to use other services for container registry, auth, key vaults, etc. The idea was that we could easily move to another manage solution one day. While being vendor-agnostic is a great idea, for us, it came with a high opportunity cost. After a while, we decided to go all-in on AKS-related Azure products, like the container registry, security scanning, auth, etc. For us, this resulted in an improved developer experience, simplified security ( centralized access management with Azure Entra Id), and more, which led to faster time-to-market and reduced costs (volume benefits). - Customer Resource Definitions
Yes, we went all in on the Azure products, but our guiding star was to have as few Custom Resource Definitions as possible, and instead use the built-in Kubernetes resources. However, we had some exceptions, like Traefik, since the Ingress API didn’t fulfill all our needs. - Security
See below. - Observability
See below. - Pre-Scaling During Known Peaks
Even with the auto-scaler, we sometimes scaled too slowly. By using traffic data and common knowledge (we are a logistics company and have peaks at holidays), we scaled up the cluster manually (ReplicaSet) a day before the peak arrived, then scaled it down the day after (slowly to handle any second peak wave that might occur). - Drone Inside The Cluster
We kept the Drone build system in the stage cluster; it had some benefits but also some drawbacks. It was easy to scale and use since it was in the same cluster. However, when building too much at the same time, it consumed almost all the resources, leading to a rush in Kubernetes to spin up new nodes. The best solution would probably be to have it as a pure SaaS solution, not having to worry about hosting and maintaining the product itself. - Pick The Right Node Type
This is very context-specific, but depending on the node type, AKS reserves about ~10-30% of the available memory (for internal AKS services). So for us, we found it beneficial to use fewer but larger node types. Also, since we were running .Net on many of the services, we needed to choose node types with efficient and sizable IO. (.Net frequently writes to disk for JIT and logging, and if this requires network access, it becomes slow. We also made sure that the node disk/cache had at least the same size as the total configured node disk size, to again, prevent the need for network jumps). - Reserved Instances
You can debate that this approach goes a bit against the flexibility of the cloud, but for us, reserving key instances for a year or two resulted in massive cost savings. In many cases, we would save 50–60% compared to the “pay as you go” approach. And yes, that’s plenty of cake for the team. - k9s
https://k9scli.io/ is great tool for anyone who wants one level higher abstraction than pure kubectl
Observability
Monitoring
Ensure you track the usage of memory, CPU, etc., over time so you can observe how your cluster is performing and determine if new capabilities are improving or worsening its performance. With this, it’s easier to find and set the “correct” limits for different pods (finding the right balance is important, since the pod is killed if it runs out of memory).
Alerting
Refining our alerting system was a process, but eventually, we directed all alerts to our Slack channels. This approach made it convenient to receive notifications whenever the cluster was not functioning as expected or if any unforeseen issues arose.
Logging
Having all logs consolidated in one place, along with a robust trace ID strategy (e.g. OpenTelemetry or similar), is crucial for any microservices architecture. It took us 2–3 years to get this right. If we had implemented it earlier, it would have saved us a considerable amount of time.
Security
Security in Kubernetes is a vast topic, and I highly recommend researching it thoroughly to understand all the nuances (e.g. see NSA, CISA release Kubernetes Hardening Guidance). Below are some key points from our experience, but please note, this is by no means a complete picture of the challenges.
Access Control
In brief, Kubernetes isn’t overly restrictive by default. Therefore, we invested considerable time in tightening access, implementing least privilege principles for pods and containers. Additionally, due to specific vulnerabilities, it was possible that an unprivileged attacker, could potentially escalate their privileges to root, circumventing Linux namespace restrictions, and in some cases, even escape the container to gain root access on the host node. Not good to say the least.
You should set read only root filesystem, disable service account token auto mount, disable privilege escalation, drop all unnecessary capabilities, and more. In our specific setup, we use Azure Policy and Gatekeeper to make sure we didn’t deploy unsecure containers.
In our Kubernetes setup within AKS, we leveraged the robustness of Role-Based Access Control (RBAC) to further enhance security and access management.
Container Vulnerability
There are many good tools out there that can scan and validate containers and other parts of Kubernetes. We used Azure Defender and Azure Defender for Containers to target some of our needs.
Note: Instead of getting stuck in “analysis paralysis” trying to find the perfect tool, the one with all the “bells and whistles”, just pick something and let the learning begin.
Our Setup Over The Years
- Deployments
As with many others, we use Helm to manage and streamline the deployment and packaging of our applications on Kubernetes. Since we started using Helm a long time ago and initially had a mix of .Net/Go/Java/Python/PHP, we have rewritten the Helm charts more times than I dare to remember. - Observability
We started using Loggly together with FluentD for centralized logging, but after a couple of years, we moved over to Elastic and Kibana (ELK stack). It was easier for us to work with Elastic and Kibana since they are more widely used, and also, in our setup, it was cheaper. - Container Registries
We started with Quay, which was a good product. But with the migration to Azure, it became natural to use Azure Container Registry instead since it was integrated and thus a more “native” solution for us. (We also then got our containers under the Azure Security Advisor). - Pipelines
From the start, we have been using Drone for building our containers. When we first began, there weren’t many CI systems that supported containers and Docker, nor did they offer configurations as code. Drone has served us well over the years. It became a bit messy when Harness acquired it, but after we caved in and moved to the premium version, we had all the features we needed.
Game Changer
In the last few years, Kubernetes has been a game-changer for us. It has unlocked capabilities that enable us to scale more efficiently (volatile traffic volumes), optimize our infrastructure costs, improved our developer experiences, made it easier test new ideas, and thus significantly reduce the time-to-market/time-to-money for new products and services.
We started with Kubernetes a bit too early, before we really had the problems it would solve. But in the long run, and especially in the latest years, it has proven to provide great value for us.
Final Words
Reflecting on eight years of experience, there’s an abundance of stories to share, with many already fading into memory. I hope you enjoyed reading about our setup, the mistakes we made, and the lessons we learned along the way.
Thanks for reading.