The state of Kubernetes platform version management 2024

State of play for versioning, releasing & upgrading internal Kubernetes platforms for Kubernetes end users in 2024.

Nick Gibbon
Pareture
Published in
9 min readMay 17, 2024

--

Scope

When creating an internal container hosting platform based around Kubernetes you have many options strategically. Even when truly well-informed; choices are context-dependent as to what will fit in to your organisation and how it works or what you wish to target. Few clusters or many? Small clusters or large? Multi or single-tenant? Self-hosted or managed services? What distributions & technologies to combine? What laws and regulations are you going to be subject to? Can it be managed centrally? Can anything at all be centralised?

Regardless of all of this the scope of running software will always contain the following:

  • Kubernetes control plane components
  • Kubernetes nodes
  • Platform components

Day 1 / day 2 workloads. Addons. These are the in-cluster components that help run the cluster and provide additional features and integrations. For example DNS, CNIs, CSIs, ingress controllers, policy engines, monitoring integrations, PKI integrations etc.

  • Application components

These are the point of everything else. Usually off-the-shelf products or custom differentiated applications that do something of value for your organisation.

This set of components needs to be actively managed over time so that the platform continues to operate. At all points in time all components must be compatible with each other. This is fundamentally true regardless of the deployment mechanisms involved. Whether it’s 100% GitOps or 100% manual. However the components have arrived together, they need to work together and remain doing so.

Objectives

We need to continually patch and upgrade everything in scope for new functionality, access to support and for increased quality in terms of reliability, security, performance etc. through fixes and technical features over time. It’s also a regulatory concern in many industries.

Even if you experience that your current platform is good enough and you feel like you can avoid this work then unfortunately you still need to do it so that you don’t get stuck, unable to move, unsupported with no clear path forward, when you eventually find that you do need to take some action. Even if we don’t move the environment evolves around us. The further we stray from depending on work that is in active maintenance the more risk of a problem occurring that will be difficult to recover from which compounds when we have many different dependencies in the same poor state.

It’s best to have a forward strategy in this domain that is pragmatic, proactive and predictable. Small and regular doses of discipline can help reliably avoid more chaotic negative outcomes which will ultimately cost a lot more in a variety of ways.

Facts

Kubernetes components

Currently Kubernetes releases a new minor version every ~4 months. Each version of Kubernetes receives ~1 year of patch support. This means that there is always ~3 versions in support. Newest in, oldest out with each new release.

The Kubernetes control plane components can be within 1 minor version of each other. So it is only possible to upgrade 1 minor version at a time. The node components can be up to 3 minor versions behind the control plane. For example for control plane version1.30 node components at 1.27 , 1.28 , 1.29 and 1.30 are supported but not 1.31 .

Big 3 Cloud Providers

AWS EKS supports a Kubernetes version for 14 months after release and then there is a higher cost extended support period for 12 months for all versions. At the end of this period AWS may forcefully initiate control plane upgrades.

Azure AKS supports a Kubernetes version for 12 months. Provides a minimal platform support until n+4 has been released. And provides LTS support for specific versions for an additional year.

GCP GKE supports a Kubernetes version for 14 months and at the end of this period may forcefully initiate upgrades.

Platform components

Every single component has a different release cadence and the space is very active.

Application components

Applications have their own dependencies which have their own release cadences and of course the applications themselves are changed over time.

Problems

Continually patching and going through the minor upgrade process for the control plane is not a problem. Correspondingly regularly patching and then upgrading nodes is not a problem. The problem is ensuring that the in-cluster data plane components do not break whilst the Kubernetes system changes. They need to always remain compatible with Kubernetes and each other.

When Kubernetes minor upgrades and related events occur things can break. That is, something that was running in the cluster before can suddenly fail to do so. Here are a list of reasons I have experienced for this:

  • API migrations
  • API deprecation second order effects

Some API deprecations are not as simple as just using another version. For example the Pod Security Policy deprecation created a lot of work to implement an alternative so we could move to k8s v1.25 without significant regression in our security posture.

  • Different API Server or Kubelet defaults / settings / features

E.g built-in admission controllers.

  • Container Runtime migration

Primarily where underlying unix socket was being used.

  • Node machine image migration

Although not specific to a k8s version changing images, whilst completely compatible with the kubernetes components, caused certain workloads that used host-related features to stop working before changes were made.

Some of these events will not occur again especially as the project continues to become more mature and more stable. But inevitably some similar events will so that the project can progress where it needs to.

This is all to say that it’s never been viable to depend only on understanding the required API migrations. The only reliable way to ensure that everything is compatible is empirically. You need to deploy all of the components at all of the specific versions and run regression tests. To ensure that in-place upgrades are possible you must then actually upgrade, monitor and (you guessed it) test again.

A note for the uninitiated. Deploying everything together, testing, upgrading and testing again (and again) takes a long time (hours) and requires constant investment in to automation.

Choices

Given the nature of the beast. Given the support periods, the release cadence, the fact that in-place upgrades must progress sequentially and you can’t jump. Given the requirement for thorough compatibility and upgrade testing for all supported versions. Choices must be made. You need to navigate those realities with eyes wide open.

If you have complete control over the clusters and everything in them then you can do whatever you want; whatever works at whatever time. But this post assumes scale involving many teams and many clusters and we’re needing to provide a consistent path for them with some flexibility.

For starters; don’t support the unsupported. That’s a bad road to go down. It’s hard enough to support the supported. So we’ll rule that out.

Any Way You Want It

Generally, maximally this means that you support each version for around a year and you can support up to 3 versions at a time (leaving out cloud provider extended support for now). So every change that you integrate needs to be compatible with and tested on 3 specific versions of Kubernetes. Of course not all changes will be compatible across the 3 versions. To handle this in your codebase / configbase you need a single branch with the flexibility to do various things conditionally based on the version which can be complex. Or you can have a branch per minor k8s version which is simpler in one way but means a challenge in terms of effort and comprehension to integrate across all three branches as they diverge. This also means you need to ensure you support each new version and drop the n-3 every ~4 months.

This option has a good deal of overhead and complexity but it allows teams to use any version of Kubernetes for up to 12 months and during it’s normal support period and allows them to upgrade at any time during this to the next version. Permutations galore!

Any Way You Want It

Golden Path

Alternatively; is all that flexibility a good thing? Who does it serve? In my experience very few teams are up for upgrading minor versions every 4 months across all of their clusters. Most want to leave it as long as possible.

Could we just support a single path through Kubernetes version upgrades? If we made the timeline and communication really clear couldn’t we enable everyone with a predictable and dependable cadence whilst only needing to develop and maintain 1 version at a time.

9 months is ideal here where you skip a version (though a transition version is needed to upgrade in-place) and you release n+2 as you do your final patching release for n.

This means you are not upgrading too often, not needing to go through the process of skipping multiple versions but also you have a good time cushion before the underlying support window ends. If you supported for 12 months then users have very little time to upgrade to version n+3 before they are out of the Kubernetes support window on version n.

If it took you longer to upgrade then no matter. This would just mean that you had less time at the new version before the path was extended again. For example you stay at version n for 12 months instead of 9. This just now means that when you upgrade to n+2 then n+4 will be ready in ~6 months. But every 9 months or every n+2 is a really clear target and reasonable cadence to plan and align around.

Golden Path

Something Completely Different

There is currently no industry standard for longer support or clear coalescence around it. But as we’ve seen some of the big providers are doing things in this area. Another viable option — where applicable — would be to go down a golden path that uses the 2-year support period. This provides a stable base for a long time. The trouble is that by the end of this period there will be another ~6 big Kubernetes releases. It may not be feasible to expect 6 in-place upgrades so a blue / green cluster migration would be needed. As the community and vendors will be developing against the standard supported versions you risk your platform components becoming difficult to maintain in the second year. There will also be more changes to accommodate at once to jump ship to the next platform though for a while you can blissfully forget about big ugly upgrades.

Something Completely Different

Pick your poison. It’s trade-offs all the way down.

Further watching on the related topic of Kubernetes LTS:

--

--

Nick Gibbon
Pareture

Software reliability engineer & manager in cloud infrastructure, platforms & tools.