Capacity Management in Kubernetes

Ewere Diagboya
MyCloudSeries
Published in
5 min readNov 12, 2019

--

k8s Capacity Management

Hey guys, it's been a while. I have been in and around and have been soaking in new stuff that I am excited to share. It has been three years of using Kubernetes for test and roughly a year of running Kubernetes on production systems and I have come with some salient learnings I will like to share.

As we all know Kubernetes is the buzz of the internet and this makes it the hottest topic now. Sysadmins, DevOps Experts, and C-level executives are all excited with the promise of Kubernetes. But like with every tool and system and as the saying goes; ‘When the purpose not known, abuse is inevitable’. Kubernetes promises a lot of things that all tie around high availability. But in the end no matter how good a tool has been talked about and no matter the number of articles you read, the architecture is always the key to bringing out the best out of it.

In this piece, I will be talking about how to plan capacity on your Kubernetes Cluster, and how it affects the availability and scalability of your system. From the OnPrem days, we were thought to always over-estimate capacity so during spikes new resources can be easily provisioned and attached to the existing resources. Now in the days of the cloud, principles like the AWS Well-Architected Framework teach us to stop guessing capacity and use capacity OnDemand and tear it down if not in use.

Both of these schools of thought are quite correct based on the environment and the circumstances. In Kubernetes there are three major resources that determine the availability of your cluster and application:

  1. Node CPU Core
  2. Node Memory
  3. Node Pod Capacity

Node Pod Capacity is our focus here. Now the number of pods your a node can take is determined by different factors. We will focus on the factors around EC2 Instances on AWS. Now there is a formula that can be used to calculate the number of pods a node can take. The formula goes like this:

Source: here

An example is when you have a t3.large instance that you want to use as a node in your cluster. You will have to calculate the maximum number of pods that it can accommodate. Parameters such as Maximum supported Network Interfaces for instance type and IPv4 Addresses per Interface can be found here Since we will be using that document, we will reference will those parameters and rename them.

Let:
Maximum support Network Interfaces for instance = e
IPv4 Addresses per Interface = i

From the document these are the parameters for e and i:

From the figure above e = 3 and i = 12

Max Pods = (e * i) -1

Max Pods = (3 * 12) -1 = 35
This means that the instance can only take a maximum of 35 pods.

The danger behind not knowing this is that you are not able to plan how many pods you should have in your cluster. If you do not plan the number of pods in your cluster, then you stand a chance to experience scheduling issues when the pods capacity gets filled up and your cluster is not able to schedule more pods. It will look like this:

EKS Cluster Visualized using Rancher

If you look at the figure above you can see that the number of scheduled pods has filled up the number of available pods. This has made the cluster unstable hence the Nodes are showing red. This is made visible with the use of a 3rd-party tool called Rancher which helps us visualize the cluster.

The strange part of this metrics is you can see the CPU and Memory are not close to 30% of usage yet the pods are 100% used up. It means that as you plan CPU and Memory capacity, pod capacity is always key.

Recommendations

We have seen the danger involved if we do not plan our cluster properly. If we do not know the number of pods a Node can take and plan appropriately it can lead to catastrophic failure of pods and services. From the stand-point of high availability and effective failover, these are my recommendations on pod capacity planning:

  1. Always know the number pods a single Node can take at any time, using the formula and steps shared above.
  2. Every pod should have at least 2 Replicasets for high availability of a microservice
  3. Overestimate your pod capacity by double of what is being consumed by your pods currently. This will allow pods to be rescheduled anytime there is a failure.
  4. Always audit your pod capacity vs scheduled pods

For instance, if the pod capacity based on the number Nodes in a cluster is 70 pods (pod capacity for a cluster is the number of pods the cluster can accommodate based on the capacity of each node in the cluster), and the number of pods scheduled is 50. It means that we have a surplus of only 20 pods for services to recover with. A better recommendation will be to have a surplus of 35 which gives more capacity for pods to be rescheduled in the eventuality of failure or pods needs to be rescheduled for high availability.

Proper auditing and monitoring of your cluster are very key and crucial to maintaining high availability. Designing your system to recover from failure from the start should be part of your system architecture and not an afterthought. Kubernetes comes with failover, YES. But you need to understand and architect based on that to rip the true benefits of that and all over beautiful features of Kubernetes.

MyCloudSeries is a training and consulting firm with expertise in Cloud Computing and DevOps. We assist organizations in their DevOps strategies, transformation, and implementation. We also provide Cloud Computing Support contact us at www.mycloudseries.com

--

--