Directeam - Medium

AWS RDS Storage types — Which one should I choose?

Shir Monether — Mon, 22 May 2023 12:11:21 GMT

AWS RDS Storage types — Which one should I choose?

Amazon RDS (Relational Database Service) is a cloud-based web service that makes it easier to set up, operate, and scale a relational database. RDS is designed to provide an efficient and cost-effective solution for managing databases. One of the crucial components of RDS, is storage. Amazon RDS provides three recommended storage options for database instances: GP2, GP3 & IO1 (Magnetic is for backwards compatibility).

In this blog post, we will explore the differences between storage types and when to use each type. But before we dive in, you’ll need to understand the main differences between GP2 and GP3 volumes.

Here is an AWS blog regarding migrating EBS with a great comparison table.

TLDR; GP2 IOPS and throughput get larger with volume size, while GP3 requires you to configure the amount of IOPS and throughput required.

Traditional EBS vs RDS storage

RDS storage is backed by AWS EBS but it has a few tricks that do not come out of the box with traditional EBS, for example an EBS volume can be up to 16TiB while RDS storage can max out at 64TiB.

This also makes different performance metrics maximums:

Disclaimer: These maximums are for all database engines other than SQL Server as it doesn’t support disk striping

This is mainly due to AWS performing some magic behind the scenes that makes use of multiple EBS volumes & volume striping.

Another big consideration is price. With traditional EBS, GP3 is generally 20% cheaper than GP2. However, how does this compare with RDS storage?

They are the same price!

What’s more, as GP3 is GP3, you will need to pay more for each additional Throughput/IOPS over baseline.

Then when should I use GP3?

The obvious answer would be when I require a small amount of storage but a large amount of IOPS/Throughput. But we’re not here for the obvious answers, let’s geek out with some graphs!

Here are a few graphs depicting the difference between GP2 & GP3 regarding the amount of IOPS, Throughput & price (When equalizing GP3 IOPS & Throughput configuration to what is provided with GP2).

These charts are up to 25TB storage since after 21 it flat lines (or in the case of price continues in parallel with each other).

Disclaimer; Notice the gray areas on the IOPs and Throughput graphs? these are in a bit of a gray area (see what I did there?), These areas indicate where GP2 is burstable. For the IOPS graph, the burst can be up to 3000, and for the Throughput graph, between 170–334GB storage, the burst can be up to 250MiB/s however between 1,000GB and 1,336GB, it’s burstable up to 1,000MiB/s.

So in the IOPS and Throughput graphs, whenever GP2 is higher than GP3, we have to pay more to get the same performance! And the delta between the price gets higher the larger the storage!

Hold on, with all these technicalities between GP2 and GP3 storage I’ve forgotten the third (and most mind-blowing) storage type; IO1.

IO1 is a storage type for workloads that require huge amounts of IOPS, allowing you to go up to 256,000 IOPS! But how does its pricing compare with GP2 & GP3?

us-east-1 pricing

Well some bad news for you if you’ve been using IO1 at under 64,000 IOPS. It looks like you’re wasting money!

Comparing the pricing with GP3 you can see that you will be getting the same performance for a lot lower price! (Even without taking into consideration the free baseline IOPS you get from GP3). The main use-case that makes sense to use IO1 over GP3 is when you’re restricted by GP3s IOPS limit (64,000).

Disclaimer: There is another metric that slightly differs between GP2/3 and IO1, which is latency. While all provide single digit latency, for GP2/3 it’s for 99% of the time, while for IO1 it’s 99.9% of the time. This may be relevant if you have tight SLO/SLA requirements, but keep in mind the trade-off of potentially thousands of dollars a month. If you want to deep dive into more granular resolution, Percona wrote this great article benchmarking the performance of each storage type.

So how do I decide which storage type fits my needs?

So the answer to this question is the good old; It’s complicated.

There are a few rules of thumb we can go by;

If you require more IOPS than what GP2 offers at the storage size you require, better go with GP3.
If you don’t know your IOPS and Throughput requirements, use GP3 when your storage is lower than 4TB, and when your storage is over 4TB use GP2 (This makes sure you get larger baselines).
If you know how much Storage, IOPS and Throughput you require, you can calculate which one would be either cheaper, or (in the case of smaller storage sizes) provides larger IOPS & Throughput headroom at the same price (GP3 baselines).
If you don’t need over 64,000 IOPS don’t use IO1.

Awesome, so I’m all set?

Well it would be great if this was the end of the story right?

But I’m afraid there is another parameter we have to take into consideration when configuring these metrics, which can very well be a bottleneck; the instance type.

In AWS each instance type has it’s own IOPS and Throughput limits for the instance itself (You can view the whole table here). This means that we can hit an IOPS or Throughput bottleneck if we aren’t using a large enough instance.

For example let’s take the m6g.large instance which has a baseline Throughput of 78.75 MB/s and baseline IOPS of 3600 (Yes, these are also burstable but only for 30 minutes in a 24 hour period so I wouldn’t count on it). If I use m6g.large for my RDS instance I wouldn’t be able to take advantage of any storage configuration that provides more than the instance constraints.

This is quite rare as usually you’ll be required to have a larger instance to answer to CPU and memory requirements at this scale before you hit the IOPS and Throughput limits. But if you are experiencing inability to utilize the storage configuration you may want to keep this in mind.

If you want to learn more about cost optimization strategies and configurations, please do not hesitate to get in touch with us to discover how we can help you, own the cloud — https://directeam.io/contact/

AWS RDS Storage types — Which one should I choose? was originally published in Directeam on Medium, where people are continuing the conversation by highlighting and responding to this story.

Kubernetes resources under the hood — Part 3

Shon Lev-Ran — Wed, 07 Sep 2022 08:58:02 GMT

Kubernetes resources under the hood — Part 3

Kubernetes resources, breaking the limits! Understand the biggest Kubernetes misunderstanding and why you should remove your CPU limits and unleash your cluster's full potential

Co-Authored by@shirmon

What have we learned so far?

So far I’ve mostly explained things you probably know in one way or another, I’ve explained concepts like allocatable resources & Quality of Service (QoS) in the first part, and CPU requests and shares in the second part.

In this part, I will focus on what is happening when you set CPU limits.

TLDR; don’t set CPU limits!

I highly recommend you read the previous parts of this blog to get a solid understanding of what I’m about to explain. Once you’re done, if you did the math correctly, you should start to understand that CPU limits are not the way to achieve a fair division of CPU time between containers.

CPU time is divided among the containers by their CPU shares (or requests).

The CPU Limits myth

As I’ve said, when getting into Kubernetes we are advised to set CPU limits to make sure we aren’t being ‘noisy neighbors’, meaning that if our workloads start to be CPU hungry it won’t eat up all of the CPU that our other workloads need. But I’m afraid this is a complete myth since, as I’ve explained in part 2 — CPU requests guarantee your workload to receive at least this amount of CPU from the CFS.

What about overloading the node?

CPU is a compressible resource, and this is why there is no eviction for CPU stress, only throttling (delay).

What do CPU limits actually do?

As I’ve explained in the first part, Kubernetes will throttle the CPU usage of containers that reach their CPU limit.

This is done by configuring the following CGroup parameters:

“cpu.cfs_period_us” — This configures what a “CPU period” is in microseconds, or if we are using the same example as in part 2, the time interval for every new CPU pizza. Currently, Kubernetes configures this to be 100,000µs [100ms] by default, this can be configured via the kubelet configuration
“cpu.cfs_quota_us” — The CPU time in µs that the container (CGroup) can consume every period. Every 1 vCore you define as CPU limits will configure 100,000µs (100 ms) quota — equal to the period

You can learn more about the above CGroups configurations here.

So for example, if you have configured 0.5 vCore as the CPU limit (500 milli-cores), 50,000 µs (50ms) will be configured as the quota giving the CGroup a maximum usage of 50ms per the 100ms period. If the process in the container (CGroup) asks for more, it will have to wait for the next CPU period, meaning waiting for the remanding 50ms of the period, this is CPU throttling. This is only true if my container runs a single process that runs on one core (one thread can only use one core).

Single-threaded CPU limit and throttling

Sounds simple? Let’s make it a little more complicated with multi-process containers / multi-threaded processes!

Every core you use consumes time out of the quota simultaneously.

For multi-threaded (I’m looking at you Java thread-pool) or multi-process tasks, if you run on 4 cores simultaneously, you will consume your quota in 12.5 ms (4 milliseconds of the quota for every actually millisecond) and will be throttled for the rest of the period — 87.5ms!

Multi-threaded CPU limit and throttling over and over again

The more cores your node has & your container utilizes, the worse the throttling will get, so if we take the same example and run it on a node with 8 cores it will consume its quota in only 6.25 milliseconds!

Multi-threaded CPU limit and worth throttling with more cores

I would show an example on a node with 88 cores, but I’m sure you get the point. Throttling hurts your container's response times drastically!

CPU Limits in real life

Here I can see the CPU usage of a CPU stress test that is single threaded (therefore can only use up to 1 core) with 3 cores as CPU limit.

Single-threaded stress test with 3 CPU limit

As expected we aren’t getting throttled by the CPU limit and the process is getting all of the CPU it needs (or in our stress test case, all the CPU it can get!).

Bonus myth: Another best practice misunderstanding is to set your CPU request or limit at 1 vCore or below. This is only true for containers that are single-threaded, and yes, it’s better to use multiple containers or pods for parallel jobs than to replicate processes for the same task in the same container, but languages such as Java and GoLang are highly concurrent by design, so when utilizing multiple threads with concurrency, you absolutely need to set more than 1 vCore as your CPU request if your app requires it.

Here is an example of a multi-threaded CPU stress that also has 3 vCPU limits and as you can see, it uses all of its available CPU and is suffering from a huge amount of throttling.

Multi-threaded stress test with 3 CPU limit

The measurement of CPU throttling is a very interesting topic, maybe it will get its own blogpost in the future.

Note that I didn’t show you the total CPU usage of the node, and in our stress pod example, there isn’t any other CPU-intensive activity on the node. But the stress pod is throttled regardless of this fact due to the CPU limit, causing idle CPU to be wasted.

I have already explained why the node’s CPU stress is not really a problem, the stressed pod has X CPU shares, and in case other pods will require CPU (that are in their requests), they will get it, and the stressed pod will just gain less idle CPU, being throttled down to its requests.

The CPU Limit anti-pattern

Let’s see it in action! I have a cluster with one worker node that has 2 vCPUs.

First, I started 2 pods running a full CPU stress. Both of them have no CPU request or limit, falling into the Best effort QoS class.

Start 2 CPU stressed pods — BestEffort

Containers with no CPU Requests will receive 2 CPU shares by default.

Since there are 2 containers running on the node with the same amount of CPU shares, that want as much CPU as possible, the CFS will split the CPU time between them.

Now, Let’s start another 2 pods, this time guaranteed QoS class pods with 0.1 CPU request and limit.

0.1 CPU requests is 100m CPU, which equals to 102 CPU shares. Do you think a “higher” QoS pod with 50 times the CPU shares will get more CPU than the BestEffort pods? Hopefully, by now you understand that the CPU limit in fact prevents this from happening.

Start 2 CPU stressed pods — Guaranteed

And voilà! The Guaranteed pods get up to the CPU Limits and never more, no matter if there is idle CPU on the node or not.

Here is a closer look at the CPU allocation for those pods.

Only the BestEffort pods enjoying the spare CPU on this node

Notice the small drop in CPU usage? This is from the deployment of the Guaranteed pods, but not because of their QoS class, only due to the fact that they have more CPU shares!

The final step in this experiment is to start another set of 2 pods, this time with the same 0.1 CPU Requests but with no limit. Burstable QoS.

Notice what happened to the BestEffort pods? No, I didn’t delete them, they just got run down almost completely, to the point you can’t even see them in the graph! This is because the new Burstable pods have much more CPU shares. The Guaranteed pods are ‘guaranteed’ (See what I did there) to get the same amount of CPU no matter what, they are not guaranteed to be top priority pods on the cluster.

Note that during this time, the node stays consistently on 100% CPU and didn’t crash, not only did it not crash, all of the components on the node like Kubelet, container runtime, SystemD, and the others kept working just fine. The hungry pods fought only for the CPU leftovers.

❗ The bottom line. CPU limits are only for preventing the use of CPU leftovers, not to prevent noisy neighbors or to protect your nodes from overallocation. Mic dropped.

🔥 So go ahead and remove your CPU limits!

When to use CPU limits?

But wait, why do CPU limits exist in the first place?

You may have heard that Google uses CPU limits in their workloads, and assumed that It must be the best practice. Well, it depends on what you’re trying to achieve, Google prefers consistent workloads over performant workloads. It makes much more sense for organizations with many groups that consume resources from one central cluster (or cluster operators). They must have reproducible performance every single time at the group level, before thinking about performance at the organization (or cluster) level.

Are you Google? Probably not. Most of us are trying to achieve the best possible performance on the cheapest infrastructure while minimizing downtime. Production workloads should be able to utilize idle CPU. Containers won’t “steal” CPU from other containers if you set your CPU requests right, if you didn’t set CPU requests or set them badly, I’m afraid that the CPU limit wouldn’t save you.

Similar to Google’s use case, GKE AutoPilot is another great example.

GKE AutoPilot in a nutshell is a managed Kubernetes cluster, that not only manages the control plane for you but also the nodes, you just need to apply your pods.

AutoPilot will always set CPU requests and limits for you (unlike AWS Fargate), even if you didn’t define it in your pod spec. That will guarantee consistent performance over and over again. You can’t enjoy idle resources but you can be sure that you will always get the same performance no matter what node you’re running on.

Real-life use-cases

In our day-to-day, we may want to set CPU limits on staging environments to simulate “the worst case scenario (no idle resources to consume)” and to be able to run stress tests without using the idle CPU that can’t and shouldn’t be counted on.

In production, there is not a good usage for CPU limit, even for low-priority containers. If your goal is to reserve the idle CPU capacity for your important workloads, just adjust the CPU Requests accordingly.

CPU resources best practice

After months of diving into the Kubernetes resources rabbit hole, the conclusions I’ve come to are:

Set your CPU Requests as the relative weight you want the container to have. No less than the expected CPU usage.
Concurrency matters, you can’t run on more cores than your task knows how to utilize, so don’t set CPU requests higher than 1*(number of concurrent threads / processes) you have.
Never set CPU limits if performance is what you desire.

This is what Tim Hockin, one of the first creators & maintainers of Kubernetes at Google advised in a tweet a few years back;

Tim Hockin (thockin.yaml) on Twitter: "This is why I always advise:1) Always set memory limit == request2) Never set CPU limit(for locally adjusted values of "always" and "never") / Twitter"

This is why I always advise:1) Always set memory limit == request2) Never set CPU limit(for locally adjusted values of "always" and "never")

Wait, memory request should be equal to the memory limit? We’ll dive into this in the fourth & final part of this blog. stay tuned!

Kubernetes resources under the hood — Part 3 was originally published in Directeam on Medium, where people are continuing the conversation by highlighting and responding to this story.

Kubernetes resources under the hood — Part 2

Shon Lev-Ran — Sun, 04 Sep 2022 14:20:37 GMT

Kubernetes resources under the hood — Part 2

Do you think that CPU requests are just used for scheduling? Think again. Introducing CPU Shares, and laying the grounds for removing your limits!

Co-Authored by@shirmon

Understanding CPU Requests

In the previous post, I talked about the foundation of Kubernetes resource management. In this post, we will dive deeper into what is going on behind the scenes when we configure CPU requests to a pod’s containers.

apiVersion: v1
kind: Pod
metadata:
  name: frontend
spec:
  containers:
  - name: app
    image: images.my-company.example/app:v4
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m" # For the last time!

Resource requests are first and foremost used for scheduling decisions, but is there anything more to CPU requests?

CPU Shares

When you configure an X amount of vCPUs as a container CPU request in your pod’s manifest, Kubernetes configures (1024 * X) CPU shares for your container.

For example, if I configure 250m for my CPU requests Kubernetes will set 1024 * 250m = 256 CPU shares.

So what are CPU shares and what do they do?

To understand CPU shares, let’s talk first about the Kernel mechanism called CFS (Completely Fair Scheduler).

CFS — Completely Fair Scheduler

CFS is the default Linux CPU scheduler and is in charge of allocating CPU time between processes fairly.

The “completely fair” part is not as simple as it sounds, it uses a few parameters to decide what is the relative weight (priority) of each process. Many of you may be familiar with the “nice” setting that can be set for processes to change their relative weight. But currently, Kubernetes doesn’t use nice to affect the processes’ weight, instead, it configures CPU shares for a CGroup.

So, CPU shares are a Linux CGroup feature that is designed to prioritize CGroup processes for the CFS to allocate more CPU time at times of congestion to the higher priority processes.

Let me explain;

Let’s think of a single CPU timeframe (1 second for example) as a pizza. Every second a new pizza comes out of the oven, processes eat what they need from it, and then it’s gone. If all of my processes are not hungry enough to eat all the pizza in 1 second, they will eat their fill until the time is over and a new CPU-second-pizza will come out of the oven. Yummy! 🍕

CPU feeding Homer

Sufficient CPU per second

The complications start when our processes are hungry and 1 pizza every second is not enough to feed them.

Insufficient CPU per second

When there is not enough CPU time (or pizza) for all of my processes, CFS will look at the shares every CGroup has, will cut the pizza into the sum of all shares, and will split it accordingly.

In the case that many of the processes in the CGroup want more CPU than available, the slice that each CGroup receives will be evenly distributed between the processes in that CGroup.

So for example, if processes in 5 CGroups are requesting the maximum amount of CPU possible, and each of the CGroups has an equal amount of CPU shares, then the CPU time will be distributed evenly between the CGroups.

CPU shared between CGroups with the same amount of shares

Another example is (Staying in the state that all processes are requesting as much CPU as possible); if I have 3 CGroups with 1024 CPU shares each, and one other CGroup with 3072 shares the first 3 CGroups will get 1/6 of the CPU, and the last CGroup will get half (3/6)

CPU shared between CGroups with different amounts of shares

Remember, all of this only matters if I’m lacking CPU, if I have 3 CGroups with X CPU shares that need a lot of CPU and the fourth CGroup with 1000X CPU shares that is idle, the first 3 will split the CPU equally.

CPU shared only between hungry CGroups

Can my container even have 1,048,576 CPU shares on Kubernetes? Only if my node has more than 1024 CPU cores such as the Epiphany-V, but I’m sure most of us don’t have those kinds of nodes.

How Kubernetes uses these features

So as I’ve said, Kubernetes CPU requests configure CPU shares for our containers CGroups,

Shares “over-commitment” is prevented by Kubernetes magic; On one hand, the scheduler only schedules on each node the total amount of CPU requests to be lower or equal to the amount of CPU on the node (allocatable — see previous part). On the other hand, the CPU shares you provision can be up to 1024 times the number of cores. That sets a cap on the maximum number of shares that can be used by the pods, and the ratio remains.

The sum of CPU shares your containers can have on Kubernetes is 1024 times the number of allocatable CPUs you have in your cluster.

Real-life examples

I tried to make the previous examples as simple as possible so I removed some important parameters such as:

Threads and processes count in each CGroup
The CPU consumed by the node (other than your running pods)

There are some other parameters that don’t take effect although you might think so. Such as:

Quality of Service (QoS)
Pod priority
Evictions

Let’s have a shallow dive into them;

Thread Count

When we run just a single process in our container, if that process only creates a single thread, it can not consume more than one core anyway. When you set CPU requests to your containers, always bear in mind the number of threads they will run.

A side note — threads are not free, try not to use too many treads as each thread brings its own overhead, and increase the number of replicas instead.

Node Load

The bar charts from earlier are for isolated processes, but not all processes are isolated. Not to worry! The CGroups for your containers are pretty low on the CGroups hierarchy.

Simple Kubernetes CGroups map

Maybe while reading you already went to check how many CPU shares your Kubelet has to make sure it’s not deprived. Don’t worry, Your pods and containers are just sharing the CPU time “kubepods” CGroup is eligible for. If the Kubelet, the container runtime, or other services on the node need CPU time, they will get it.

Don’t worry when setting high CPU requests, the node’s components are higher priority out of the box.

Quality of Service

Kubernetes is configuring CGroups per QoS, currently, they have no real function and they exist for future use.

In terms of CPU time and priority, the CPU Request is the only thing that matters.

So what will happen if you don’t set CPU Requests? The container will get 2 CPU shares by default and will have a very low priority compared to pods that have CPU requests configured.

CPU time allocation will be the same both for burstable and best-effort pods. Guaranteed will have another parameter impacting the CPU time. More on that in the next part.
Bottom line is that the QoS doesn’t directly affect the CPU time a pod's containers will receive. The only thing matter is CPU shares (and limits if you still use them).

Pod Priority

“It’s OK, I set pod priority.” — Sorry but not exactly…

Pod priority is only used to determine the termination order on node evection, and as we’ve mentioned; There is no eviction caused by CPU pressure.

Evictions

Eviction is a process running on the node that chooses and kills pods when the node is low on resources. Eviction only happens for in-compressible resources like memory, disk space, etc. more on that in the fourth part.

Not just for scheduling

We learned that CPU requests are used not only for scheduling purposes but also for the lifetime of the container. Memory requests also have their deep layers, more on that in part four.

Also, we talked only about normal Kubernetes behavior, there are many other options like CPU pinning that sets exclusive CPU cores per container. That’s outside of the scope of this article, but we may get into it in the future 😄

To summarize;

We learned that CPU requests are not used just for scheduling, but also take a huge part in the whole container lifecycle! We learned the importance of setting the correct requests to configure the right amount of CPU shares for each container and why configurations such as QoS don’t really affect our workloads.

Remember! CPU requests configure how much CPU will be Guaranteed to your container throughout its lifecycle!

Part 3 is out! Explains all about why you should remove your CPU limits.

Kubernetes resources under the hood — Part 2 was originally published in Directeam on Medium, where people are continuing the conversation by highlighting and responding to this story.

Kubernetes resources under the hood — Part 1

Shon Lev-Ran — Sun, 04 Sep 2022 14:17:37 GMT

Kubernetes resources under the hood — Part 1

I’m sure we’re all familiar with the ‘resources’ block of containers in a pod. But do we really know what Kubernetes uses them for under the hood?

Co-Authored by@shirmon

One of the very first things that we are taught by the community when starting to use Kubernetes is always to set requests and limits for CPU and memory on every container in our pods.

When you specify a Pod, you can optionally specify how much of each resource a container needs. The most common resources you’ll specify are CPU and memory (RAM); there are others. source

apiVersion: v1
kind: Pod
metadata:
  name: frontend
spec:
  containers:
  - name: app
    image: images.my-company.example/app:v4
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

If a container specifies its own resource limit but does not specify a resource request, then Kubernetes automatically assigns a resource request that matches the specified limit. source

However, after years of experience with many use cases and having to investigate many resources-related issues, I have discovered that Kubernetes resource management is a lot more complex than it seems.

Let’s start from the beginning

Kubernetes is a container orchestrator that deploys workloads (pods) over a pool of resources (nodes). Of course, this is a huge simplification since Kubernetes is a lot more complex and schedules pods using many different parameters, but what I want to dig into in this article (if it’s not already obvious) is how Kubernetes manages container resources.

So which resources can Kubernetes manage? Containers consume many kinds of resources. The obvious ones are resources like CPU and Memory, but they can also consume other resources such as disk space, disk time (I/O), network bandwidth, process IDs, host ports, IP addresses, GPU, power, and more!

Pods requests and nodes resources

First, let’s take a deep dive into containers

So, what are containers really?

In a nutshell, containers are a set of Linux namespaces.

So, what are Linux namespaces?

Linux namespaces are a Linux kernel functionality that partitions kernel resources such that a process or set of processes in the same Linux namespace can see a set of kernel resources and are isolated from processes in other namespaces. Some examples of these namespaces are PID, UID, Cgroups & IPC (see the complete list in the wiki).

Another thing to know about namespaces is that they are nested, meaning namespaces can be inside other namespaces. Child namespaces are isolated from their parent namespaces, but the parent namespaces can see everything within the child namespaces.

Technically speaking, when running a Linux machine, you are already inside a container (since you are in the first set of namespaces). We utilize the isolation advantages of containers when creating another set of namespaces in the same system.

So, when spinning up a container, it creates a set of these namespaces and runs your application inside them. This is also why inside of a container, you will see the PID of your application usually set as 1 (or a low number depending on what you’re running), while outside of the container (in the main PID namespace), the PID of your application will be a far larger number. This is the same process, but the PID in the container is mapped to the higher PID in the main namespace and isolated from it and any other sets of namespaces (other containers).

Namespaces give us the ability to isolate processes from each other, but what about resource consumption? If all of our containers think they are operating in isolation, couldn’t they consume too much of the resources and impact the others? This phenomenon is known as noisy neighbors.

So how can we deal with noisy neighbors? One approach is to limit the resources each process can consume, and (surprise, surprise) the Linux kernel has another feature up its sleeve that can do just this, called Control groups (Cgroups). These are configured for each process to limit, account for, and isolate the resources they each consume. Using this functionality, Kubernetes can limit the resource usage of containers.

Currently, Kubernetes uses Cgroups v1, but another player has entered the arena (for the last five years), Cgroups v2! Their current use would be for Memory Quality of Service (QoS), which, since 1.22 is in alpha, has opened a whole new world of possibilities. You can read all about it here.

What resources are currently managed by Kubernetes?

Kubernetes by itself currently only manage a fraction of the resources present. First of all, it lists the capacity for each resource on its nodes.

# You can see it using kubectl.
kubectl get node -ojson | jq '.items[].status.capacity' 
{
  "cpu": "2",
  "ephemeral-storage": "52416492Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "8003232Ki",
  "pods": "110"
}

Then, it calculates the allocatable amount used for pods scheduling. The allocatable resources of a node are calculated by subtracting a buffer of reserved resources for the Linux system, kubelet, and the eviction threshold from the node’s total resources. As of 1.21, the kubelet only calculates the allocatable resources for CPU, memory, huge pages, and ephemeral storage.

Nodes capacity components

kubectl get node -ojson | jq '.items[].status.allocatable' 
{
  "cpu": "1930m",
  "ephemeral-storage": "47233297124",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "7313056Ki",
  "pods": "110"
}

Each allocatable resource is a vector that the Kubernetes scheduler uses for scheduling decisions.

Requests

When scheduling pods, the scheduler only considers the pod’s container requests against the allocatable resources (which naturally lowers the amount of allocatable resources, so the next pods requests will have fewer allocatable resources it can request to be scheduled). It does not consider the actual resource usage on the node (i.e. containers that use resources over or below their requests).

If the containers in my pod have no requests assigned, Kubernetes can schedule them to any nodes (if, of course, there are no other scheduling restrictions).
By default, Kubernetes can schedule up to 110 pods per node.

Since Kubernetes 1.21, the main resources you can request from Kubernetes are CPU, Memory, Ephemeral storage, and HugePages. In addition, you can accomplish scheduling by requesting custom resources using extended resources (which can also be applied with controllers such as the Nvidia controller for GPU).

Note that you can also limit PIDs consumption per pod at the node level.

So we’ve learned that resource requests are important for scheduling (Not just for scheduling, see next part for more information), where all of the requested resources must be available in the node, including extended resources.
We’ll dive deeper into the other effects of CPU requests in the second part of this blog post.

Limits

Resources are considered both for scheduling and runtime. To limit our containers from overloading and consuming too many resources, Kubernetes utilizes Cgroups. Kubernetes uses container limits to define the Cgroups and limit their resource consumption.

Compressible vs. incompressible resources

I want to take a step back for a moment to talk about the two different types of resources, compressible and incompressible.

A compressible resource means that if the usage of this resource reaches its maximum, the processes that require this resource will have to wait until the resource becomes free. In other words, throttling the processes.

Think of it as a water dam; when the outlet pipes of the dam are full, and the flowing water arriving at the dam exceeds these pipes’ capacity, the water inside the dam will fill up. Usually, we measure compressible resources by time.

Dam of compressible CPU throttling

CPU is a compressible resource, meaning if the CPU usage is at 100%, a process that requires CPU will need to wait until they receive CPU time.

There is no eviction for compressible resources!

On the other hand, a resource being incompressible means processes cannot wait for it; either they cannot run, or something else has to stop and release resources for the new process.

Incompressible box shelves

Think of it like putting boxes on shelves, once the shelves are filled with boxes you cannot put another box on the shelf. You either have to make room by removing boxes from the shelf or not placing the box on the shelf at all. Memory is an incompressible resource, meaning if you are out of memory and want to allocate memory for a new or existing process you have to either kill a process that is taking up memory space or the process will crash.

For Kubernetes, the only compressible resource that it manages is the CPU. The other resources Kubernetes manages (memory, HugePages, Ephemeral storage, and PIDs) are all incompressible.

When you specify limits for compressible resources like CPU, Kubernetes makes sure to throttle them when they try to consume more than their allowable levels. On the other hand, Kubernetes has to deal with limits for incompressible resources using eviction. We will dig into this in the upcoming blog posts.

Requests vs. Limits

So we know we use resource requests as our “manual” guide for the Kubernetes scheduler to make scheduling decisions based on the minimum amount that we need to ensure our workload.

We can also use resource limits as instructions to Kubernetes for which Cgroups it should configure for our containers and their thresholds.

When using extended resources, Kubernetes will use requests for scheduling but will not use the limits to set any Cgroups and limit those special resources usage.

Quality of Service — Not really the bottom line

We can specify resource requests and limits for the containers in our pod; based on those parameters Kubernetes also assigns a QoS class (Quality of Service) to our pods.

# Try this command to view your current QoS.
kubectl get pods -A -o=jsonpath='{range .items[*]}{.metadata.namespace}{" : "}{.metadata.name}{" --QoS--> "}{.status.qosClass}{"\n"}{end}'

As good as it sounds, quality of service is not the last word in terms of pods’ priorities. This parameter is visible to us, as Kubernetes users, to estimate the probable priority of our pod in case of high resource stresses and eviction events. There is a lot more to it, such that so-called lower QoS pods might survive eviction events while higher QoS class pods may be terminated.

By the end of this blog series, you will know everything you need to know about the implication of QoS.

First of all, there are three classes of QoS:

Guaranteed
Burstable
BestEffort

For a Pod to have a QoS class of Guaranteed, every container in the Pod must have both memory and CPU with limits and requests that are equal.

A Pod has a QoS class of Burstable if the Pod has at least one Container with a memory or CPU request.

For a Pod to have a QoS class of BestEffort, the Containers in the Pod must not have any memory or CPU limits or requests.

Note that this only uses CPU and memory for calculating the QoS class of the pod.

Regarding the usage of QoS, you should be aware:

It’s used to set the OOM_Score_adj parameter — more on that in part 4.
It’s used to set QoS Cgroups — which so far have no effect and is a future QoS feature.

To summarize;

So that was a lot of information to go through, and this first part was just getting the basics out of the way.

The second part of this blog is out, we will start digging into the whys and whats of these features to understand exactly how Kubernetes uses them and what you should be putting in your resource requests and limits!

Kubernetes resources under the hood — Part 1 was originally published in Directeam on Medium, where people are continuing the conversation by highlighting and responding to this story.