Capacity Planning — Kubernetes Cluster Deployment
Most of the organisations these days tends to move to digitalisation and trying to adopt with the open source containerisation platforms to cater their need in digitalisation. When we take the current options that we have on the containerisation platforms are:
Kubernetes, Openshift, Hashicorp Nomad, Docker Swarm, Rancher, Mesos, Google Container Engine (GKE), Google Cloud Run, AWS Elastic Kubernetes Service (EKS), Amazon EC2 Container Service (ECS), AWS Fargate, Azure AKS Service, Azure Managed Openshift Service, Azure Container Instances, Digital Ocean Kubernetes Service, Red Hat OpenShift Online, Linode Kubernetes Engine
Refer the link here to get more in depth detail of comparison of each. In this article my intention is to give more on the Kubernetes Cluster capacity planning and things to consider when deploying a kubernetes cluster. Because when we start to deploy the cluster it is better not to just deploy a cluster, the intention is to make sure that the deployment is stable and can handle more load and deployments when it comes to future use cases.
1. Understanding the Kubernetes architecture and the main components interaction
Before just with anything better to understand the architecture and how each components interacts. Then only if there is any issue in the deployment we can make it work or can easily find the root cause. Typically below diagram illustrates a high level understanding of the architecture and the flow. This understanding is typically enough to start the deployment.
2. Choosing the Deployment Script and key considerations
When choosing the deployment script officially there two kind of options available based on the kubernetes.io,
In my opinion it’s better to use the kubeadm and prepare the ansible scripts based on your need following the steps provided by the kubernetes documentation. Then you will have the full control of the deployment and can understand what is actually we are deploying. When preparing the scripts make sure you are handling the below scenarios.
- Script should support multiple server options like changing the no of nodes in each components while installing.
- Script to Uninstall / Reset the cluster.
- Script should update the certificates, because by default the kubeadm generates certificates for 1-year only so your script should update it at anytime you re-run it.
- Script should have the capability of adding and removing additional worker nodes from the cluster.
- Make sure the docker related data directory is pointing to your /application directory which is created with high amount of disk space. Otherwise by default the docker data directory will be pointed to /var and in future you may get disk space issues or you need to increase the /var directory.
3. Selecting the Deployment Pattern
Kubernetes provides two kind of options for a High Available Cluster Deployment.
- With stacked control plane nodes, where etcd nodes are colocated with control plane nodes
- With external etcd nodes, where etcd runs on separate nodes from the control plane
This can be chosen depends on the need actually. If going with high available required environment and need to have HA at each level of the main kubernetes components, then can go with external etcd option. This topology decouples the control plane and etcd member. It therefore provides an HA setup where losing a control plane instance or an etcd member has less impact and does not affect the cluster redundancy as much as the stacked HA topology.
4. Analysing the Requirements and Planning the cluster hardware requirements
When we are at the start of the implementation phase it is difficult the decide how many microservices will be deployed to the kubernetes cluster and it is always a blocker or pain point in providing a hardware specification to the infrastructure provisioning. Below is a template i came up with to avoid this blocker at the starting point of a project.
based on this fill the services that you are going to deploy, initially start with the common services that needed as a support for the use cases. Then add a line for the use case related services and mention an estimated no of services, considering provision for future. That will be a rough estimate of microservices that can be developed in future.
Then based on the total memory and cpu limit decide the RAM and CPU of the each worker machines that’s needed. Because for the Control Plane and ETCD nodes we can go with the defined values, but worker nodes will change based on the number of use cases you are going to develop. due to that there is a need that we need to provision for future.
Here, in a typical high available system make sure you keep the [x] = 3.
When it comes to the worker nodes, it depends on the analysis we had earlier. The point to note here is better to split the load to multiple workers, without scaling vertically it is recommended to scale horizontally, so even when one server is down, the others can manage and schedule the pods to them. A Typical template of the worker nodes hardware specification can be like below:
Here, make sure that the [X] is greater than 3, as we used to follow odd numbers you can go ahead with 3,5,7 worker nodes depends on the stats you got earlier.
5. Monitoring the Cluster
Here, also multiple options available.
- Kubernetes itself have a dashboard where you can use it for monitoring the cluster.
- The kubectl, which is a command line tool
- There is an open source tool called https://k8slens.dev/as well, this also a useful tool when managing the kubernetes cluster.