Stop using one large cluster: Best practice on Kubernetes cluster management

Published in

Nerd For Tech

7 min readNov 30, 2021

TL;DR:

Instead of many small Kubernetes clusters, large clusters are still commonly used among organizations. However, centralized cluster management can reduce availability, slow down development, and even make product teams suffer from organization silos.

Although large clusters benefit us to a certain degree, supplemental solutions are under active development to mitigate the inconvenience of applying small distributed clusters. Prepare teams starter kits to speed up setup and keep the compliance gate with the help of policies. Meanwhile, building communities on cross-functional techniques is also essential because more capabilities are required for teams than the central administration way.

Definition

Large can be pretty vague to describe a Kubernetes cluster. You can compare two clusters on several different levels, like node, pod, container.[1] There is no absolute criterion to tell a large cluster apart

Large is often intuitive, like Bloaters code smell, and can tell us something deep is wrong. Rather than large, we’d better describe the feeling exactly with the word complicated because a sense like that always implies a thing is doing too much and not simple anymore.

In practice, we can treat a cluster with any of the evidence as a complicated one:

hosts multiple applications belonging to different teams
puts non-production environments and production environments together

Note application is another vague word in the micro-service decade. To follow Kubernetes’ world’s norm, we treat a business capability service as an application, including several resources to make it deployable. Sometimes, a team is responsible for more than one service, but it should own highly related services to remain cohesive.[2]

A Kubernetes application contains a bunch of components

Complicated clusters give people a sense of being out of control, and besides this, they have some apparent weaknesses compared with more and smaller ones.[3][4]

Cons of one large cluster

1. Poorer Availability

One large cluster means your whole system is relying on a single point. If that cluster breaks, all your applications are down. Any mistake in network configuration, global plugins, or Kubernetes upgrade can cause an unexpected cluster-wide outage.

It is not wise to allow the blast radius to cover all of your assets. A naive but useful idea to improve availability is to divide your resources into different spaces.

2. Lower Scalability

Unfortunately, limited by the shared control plane, Kubernetes clusters cannot scale infinitely. The up limit of nodes is 5,000 in theory, but as reported, you are likely to run into troubles with even 500 nodes.[5][6]

You are also worth considering the cloud resource quotas. Cloud providers do set some quotas on many types of resources, such as computer instances, CPUs, storage volumes, and a bunch of network resources.[7] These quotas can block your cluster from claiming more resources for scaling.

3. Fewer Deployments

Another consideration is all application owners have to follow the same maintenance lifecycle, which is likely to disturb the tempos of some teams.

Moreover, an overwhelming central cluster management team is likely to bottleneck the value delivery. In contrast to self-service cloud providers, a large cluster requires a dedicated team to maintain and provide services to other groups. This centralized dependency can slow down the development and limit teams’ ability to evolve their system.[8]

4. Worse Flexibility

It is a usual case that an application requires specific customization. Some computing-intensive applications need particular worker nodes with GPUs, and some rely on a specific CNI plugin. Some depend on a certain service mesh, and you must configure it globally.[9] A team with similar demands can suffer if the cluster is too large.

Powerful cloud services are considerable parts when designing a modern application with the help of cloud providers. One large cluster limits the flexibility of integrating desired cloud services, which have the potential to simplify development, boost performance, promote security and reduce cost.

5. Weaker Security

Last but not least, we must be aware that Kubernetes’s namespace is a soft multi-tenancy design. There are many components shared across namespaces.[10] For example, one tenant can see the services created by another tenant because of the shared DNS. Although these resource shares shouldn’t be an issue in most cases, they can increase the severity of a system breach.

And it’s not what we want to see that all members have access to the production environment. A compromised account can result in severe consequences for a cluster containing both prod and non-prod environments of application. Ideally, there are rare cases to operate on the production cluster directly, so we shouldn’t put them in the same cluster because of the soft-tenancy.

How many clusters should you have?

It’s obvious we should divide our large clusters into many small ones, but what’s the guiding principle?

To reduce a team’s dependencies and make the infrastructure self-service, I recommend you spare at least two clusters for an application team — one for non-production environments and the other for production.

For teams responsible for multiple relating applications, they can make their own decision whether to divide their two clusters into more pieces.

One attractive truth is some cloud platforms don’t charge you for the master nodes.[11] So we don’t need to be concerned about the cost increases of employing more and smaller clusters.

Though in this way we can avoid the weaknesses coming with large clusters, we also lose some benefits from large clusters:

Quick start. — it’s not easy to start a cluster with perfectly tuned configurations for an application team.
Easy administration. — distributed clusters are much more challenging to monitor and govern.
Low ability requirement. — playing inside Kubernetes is more effortless than dealing with the infrastructure.

To mitigate the pain from losing the strengths above, take a look at these three practices. They are not designed for distributed Kubernetes clusters but are surprisingly suitable to become supplements of them.

Infrastructure starter kits

A starter kit is a set of boilerplates for application teams to provision their infrastructure, including their dedicated clusters. Starter kits save teams from too many configurations to satisfy compliance requirements and help teams kick their project off really quickly.

A starter kit is very likely to contain several technology stacks. For example, Terraform is employed to help set up infrastructure. Tekton, Jenkins, or other pipeline services can be customized to provide standard pipelines as a CI/CD baseline. Don’t forget to integrate code scanning tools inside, as a starter kit should guide teams to follow good practices.

Compliance policies

Another challenge is to ensure the clusters are compliant through the lifecycle of the application. On top of clusters, all the other processes must also comply with regulations, including organization-specific internal requirements, industry-wide rules, and governmental mandates.[8] Instead of being tightly coupled with application teams and providing dedicated service, the support team should implement policies to ensure all the clusters meet the requirements, e.g. upgraded to a specific version.

With the help of policy tools like OPA and GateKeeper, we are able to keep tracking the infrastructure status. They are helpful at auditing and blocking unallowed operations too. We can use these policy automation tools to make the process more like a self-service, which means fewer team dependencies, more internal cohesion, and more decentralized governance.

Technology communities

Cluster devolution also expects a higher requirement of team capabilities. Although starter kits and policies have already provided a good beginning and continuous guidance, some hands-on skills are still profitable. A team mastering the knowledge on configuring infrastructure for their business scenarios can undoubtedly improve application reliability.

As the challenge in autonomous cross-functional teams, professional skills are difficult to spread across teams. It’s helpful to establish communities to accumulate and exchange knowledge between teams. To make a community really efficient, control its size, spare enough time, and only accept people working on similar tech stack. Also, deliberately asking members to rotate within the larger group is another effective way to gain the breadth of expertise.[12]

Conclusion

Like a code smell, we can sense something terrible when operating a large cluster. By comparing one large cluster and many smaller clusters, we know a large one has disadvantages on availability, scalability, delivery efficiency, flexibility, and security.

To avoid the disadvantages of managing a large central cluster, we should employ a strategy to distribute applications into different clusters.

However, at the same time, it comes with higher complexity and more requirements on product teams. To mitigate the inconvenience, starter kits, compliance policies, and technology communities are worth a try.

Reference

[1] Kubernetes Components | Kubernetes
https://kubernetes.io/docs/concepts/overview/components/
[2] BusinessCapabilityCentric
https://martinfowler.com/bliki/BusinessCapabilityCentric.html
[3] Kubernetes: One Cluster or Many?
https://tanzu.vmware.com/content/blog/kubernetes-one-cluster-or-many
[4] Architecting Kubernetes clusters — how many should you have?
https://learnk8s.io/how-many-clusters
[5] Considerations for large clusters | Kubernetes
https://kubernetes.io/docs/setup/best-practices/cluster-large/
[6] Not one size fits all, how to size Kubernetes clusters
https://events19.lfasiallc.com/wp-content/uploads/2017/11/BoF_-Not-One-Size-Fits-All-How-to-Size-Kubernetes-Clusters_Guang-Ya-Liu-_-Sahdev-Zala.pdf
[7] Working with quotas | Documentation | Google Cloud
https://cloud.google.com/docs/quota
[8] Compliance in a DevOps Culture
https://martinfowler.com/articles/devops-compliance.html
[9] Installing Linkerd | Linkerd
https://linkerd.io/2.11/tasks/install/index.html
[10] Namespaces | Kubernetes
https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/
[11] The Ultimate Kubernetes Cost Guide: AWS vs GCP vs Azure vs Digital Ocean
https://www.replex.io/blog/the-ultimate-kubernetes-cost-guide-aws-vs-gce-vs-azure-vs-digital-ocean
[12] Products Over Projects
https://martinfowler.com/articles/products-over-projects.html