Make your Kubernetes cluster bulletproof.

Cluster resource management is always a tricky proposition. How can we explain the necessity of resource requests and limits in the workloads to the cluster user, who creates those workloads? Maybe we should automate it instead? Yes.

Dmitrii Evstiukhin

Published in

The Startup

7 min readOct 25, 2019

Problem description

Resource management is a very important topic in the context of Kubernetes cluster administration. But why is it so important if Kubernetes is doing all the heavy lifting for you? Because it’s not. Kubernetes gives you very convenient tools enabling you to solve a lot of problems… if you use those tools. For every single Pod in your cluster, you can specify resources, required by its containers. And Kubernetes will utilize this information in order to assign your application instances to cluster nodes. From my experience it appears that no one is taking resource management in Kubernetes seriously. It’s probably OK for a loosely loaded cluster with a couple of static apps, but not for a large, dynamic cluster .

But what if you have a very dynamic cluster? One where applications come and go all the time, where namespaces are being created and deleted all the time? A cluster with a lot of “end-users”, who can create their own namespaces and application deployments. In this case, instead of stable and predictable execution you’ll end up with a bunch of random malfunctions in your applications and sometimes even in Kubernetes itself!

Here is an example of such cluster:

You can see 3 Pods in the Terminating state. But it’s not because they’re supposed to be Terminating. They are stuck in this state because at some point in the past containerd daemon on their node was struck by something very resource-greedy. It might have been solved by proper out-of-resource handling, but that is not a topic of this article (see article on it here) and it is not a silver bullet either. The main reason for such an issue is improper or absent resource management in cluster.

Normally this kind of issue is probably not really a concern for deployments because they can easily create new, working Pods. Another, if a Pod becomes stuck in a Terminating state in DaemonSets or StatefulSets it could be fatal and require manual intervention to resolve the issue.

If you have a really huge cluster requiring a lot of CPU and memory, and you start heavy workloads on it without proper resource requests, there is a chance that all workloads will be placed on the same node and will struggle for resources even though the rest of the cluster stays almost free and ready to provide all the resources they need.

Also you can often witness less critical cases when some applications are affected by their neighbors. Even if those innocent applications had resource limits and requests specified — a rogue Pod can come and kill them. Here is an example of such scenario:

Your application has requests for 4 Gb of memory, but initially allocates only 1Gb
A rogue Pod without any resource configuration is assigned to the same node
A rogue Pod consumes all available memory
Your app tries to allocate more memory and crashes because there is no more available

Another quite popular case is overestimation of resources. Some developers might put huge requests in their manifests “just in case” and never use it. The result is a waste of computing resources.

Solution theory

Oh, what a horrible picture! Right?

Fortunately, Kubernetes offers a way to put some constraints on rogue Pods by specifying default, minimum and maximum resource limits and default requests. It’s implemented with LimitRange object. This is a very handy tool when you have a limited number of namespaces or full control over the namespace creation process. Even without proper resource configuration, your applications will be restricted in resource usage and innocent, properly configured Pods will be safe and protected from vicious rogue Pods. If somebody deploys a resource-greedy application without a declaration of how much resource it requires, it will get the defaults and probably will fall. But that’s it. This application won’t take anybody else down with it.

So we have a tool to enforce resources configuration for Pods and everything seems a bit more secure now, right? Not exactly. One of the traits of the “dynamic cluster” is that namespaces can be created by users, in which the LimitRange configuration might be omitted, because it has to be created in each Namespace deliberately. Ideally, we should have something not only on the namespace level but also on a Cluster level, but it seems we don’t have anything anything like that yet.

That’s why I decided to create my own solution to this problem. Let me introduce you to the Limit Operator. It’s an operator, built with an Operator SDK framework, which uses ClusterLimit custom resource and helps ensure that all innocent Pods in your cluster will be safe. With this operator, you can manage resource defaults and limits for all namespaces using the minimum amount of configuration. It also offers some granularity of configuration and enables you to choose where exactly to apply your limits.

So with this configuration, the operator will create LimitRange only in namespaces labeled as limit=limited. This could be useful in order to provide stricter restrictions on some particular sets of namespaces. If namespaceSelector is omitted, then the operator will apply provided LimitRange to all namespaces. If you want to configure LimitRange manually for some particular namespace, you can use annotation "limit.myafq.com/unlimited": true to tell the operator to skip this namespace and do not apply any LimitRanges automatically.

An example scenario of operator usage:

Create default ClusterLimit with quite liberal limits without namespace selector, and it will be applied everywhere.
For some set of namespaces with lightweight workloads, and create additional, more restrictive ClusterLimit with namespaceSelector, label all these namespaces accordingly.
On namespaces with very heavy workloads, place “unlimited” annotation and configure LimitRange manually with much wider limits than the default one.

Important note about multiple LimitRanges in one namespace:
When Pod is created in the Namespace with multiple LimitRanges, its Limits and Requests will be set from the widest possible default. But maximum and minimum values will be taken from the strictest LimitRange available.

Practice example.

The operator will track all changes in all Namespaces, ClusterLimits, child LimitRanges and will trigger reconciliation on any changes to any of them. Let’s try this out and see how it works.

To start with, let’s create Pod without any restrictions:

❯(⎈) kubectl run --generator=run-pod/v1 --image=bash bashpod/bash created
 
❯(⎈) kubectl get pod bash -o yamlapiVersion: v1
kind: Pod
metadata:
 labels:
   run: bash
 name: bash
 namespace: default
spec:
 containers:
 - image: bash
   name: bash
   resources: {}

Note: some output was omitted in order to simplify example.

As you can see — the “resources” field is empty, so this Pod can be assigned anywhere, as we discussed earlier.

Now we’ll create default cluster-wide LimitRange with liberal enough values first:

And also more restrictive limit for some subset of namespaces:

Then let’s create Namespaces with Pods in them to see how it works.

Regular namespace with only the default limit to apply:

And a bit more restricted namespace supposedly for lightweight workloads:

If you would check operator logs right after namespace creation, you would see the following:

As you can see each namespace has triggered the creation of new LimitRanges and the more restrictive namespace has two LimitRanges — the default one and the more restrictive one.

Now let’s try to create a couple of Pods in these namespaces and see how it works:

❯(⎈) kubectl run --generator=run-pod/v1 --image=bash bash -n regularpod/bash created
 
❯(⎈) kubectl get pod bash -o yaml -n regularapiVersion: v1
kind: Pod
metadata:
 annotations:
   kubernetes.io/limit-ranger: 'LimitRanger plugin set: cpu, memory request for container
     bash; cpu, memory limit for container bash'
 labels:
   run: bash
 name: bash
 namespace: regular
spec:
 containers:
 - image: bash
   name: bash
   resources:
     limits:
       cpu: 700m
       memory: 900Mi
     requests:
       cpu: 500m
       memory: 512Mi

Although we haven’t changed our way of Pod creation, it still filled the resources field. Also you might have noticed that an annotation was automatically created by LimitRanger.

Now let’s create a Pod in a lightweight Namespace:

❯(⎈) kubectl run --generator=run-pod/v1 --image=bash bash -n lightweightpod/bash created
 
❯(⎈) kubectl get pods -n lightweight bash -o yamlapiVersion: v1
kind: Pod
metadata:
 annotations:
   kubernetes.io/limit-ranger: 'LimitRanger plugin set: cpu, memory request for container
     bash; cpu, memory limit for container bash'
 labels:
   run: bash
 name: bash
 namespace: lightweight
spec:
 containers:
 - image: bash
   name: bash
   resources:
     limits:
       cpu: 700m
       memory: 900Mi
     requests:
       cpu: 500m
       memory: 512Mi

Notice that the resources in the Pod are still the same as in the previous example. This is because, as mentioned earlier — in the case of multiple LimitRanges, a less restrictive default will take place during Pod creation. But why would we need a more restrictive LimitRange then? We might want it because of “max” limit restriction, which will be used from more restrictive LimitRange. Let’s try to make our restrictive ClusterLimit even more restrictive:

Notice the “max” section in “Container” limit type:

- type: Container
  max:
   cpu: "200m"
   memory: "250Mi"

We now set it to 200m CPU and 250Mi memory. And let’s try to create Pod again:

❯(⎈) kubectl run --generator=run-pod/v1 --image=bash bash -n lightweightError from server (Forbidden): pods "bash" is forbidden: [maximum cpu usage per Container is 200m, but limit is 700m., maximum memory usage per Container is 250Mi, but limit is 900Mi.]

As we can see our Pod has tried to take wide defaults and was refused due to the restrictive Limits.

Phew!.. That was an example of ClusterLimit usage. You can try it out yourself and play with ClusterLimits on your local Kubernetes installation.

Check Limit Operator’s GitHub repo for manifests or for source code. If you think some functionality is missing — Pull Requests and Feature Requests are welcome!

Conclusion.

To summarize everything in a couple of points:

Resource management in Kubernetes is crucial for your applications stability and reliability
Configure your workloads’ resources whenever it’s possible
Enforce it with LimitRange
Automate LimitRange creation with Limit Operator

Follow these tips and your cluster won’t ever get a fatal shot from a rogue Pod.