Not every m5.4xlarge is created equally

Stijn De Haes
datamindedbe
Published in
5 min readOct 24, 2022

You might be surprised to learn that not every instance of the same type has the exact same amount of memory on AWS and this can have some severe consequences.

Said no one ever, but in practice, it can happen

At Conveyor we use AWS EKS to run all kinds of applications/workloads. We run Airflow to schedule batch jobs and also have a lot of experience running Spark on Kubernetes.

Abstracting away instances from users

At the beginning of Conveyor we allowed our users to supply Kubernetes resources directly when scheduling jobs:

apiVersion: v1
kind: Pod
metadata:
name: my-batch-job
spec:
containers:
- name: app
image: my-image
resources:
requests:
memory: "128Mi"
cpu: "250m"
limits:
memory: "128Mi"

We noticed that this was often difficult for users because:

  • they get confused about the difference between requests and limits.
  • it is difficult to make sure the instances provided by Conveyor are used in the most optimal way.

Most instances we use run with a 1CPU to 4Gb ratio, but we saw a lot of spark jobs that requested 1CPU and 12Gb of memory, which wastes a lot of CPUs. In order to make it easier for our users, we provided specific instances sizes:

The list of Conveyor instance types and the available CPU and Memory

A complete list can be found in our docs, for Containers, and for Spark jobs.

These Conveyor instance types match on an underlying AWS/Azure instance. For example the mx.4xlarge Conveyor instance type is scheduled on instances that have 16 CPU and 64GB of memory.

We typically use a mix of AWS instances types as most jobs are scheduled using spot instances. By allowing a mix we can guarantee more availability of spot instances corresponding to a specific Datafy instance type. We make sure the Cpu and Memory requests and limits in Kubernetes are set correctly such that the job can be scheduled on the correct AWS instance. This is needed because the resources of Kubernetes nodes are divided between the management overhead and the applications running on that node. So your application can only use 64GB minus the management overhead as its memory setting when being run on an mx.4xlarge node.

Not all of the 64GB is available to your applications. You need to subtract some overhead to make sure your application can be scheduled on the Kubernetes node.

Examples of such management overhead are:

  • The kubelet running on every node
  • The operating system running on the node
  • Some log/metric aggregation agents to a central storage
  • An application that manages networking on the node

Not every instance is created equally

Today I want to draw your attention to the strange behavior we noticed almost 2 years ago and again encountered last month. An instance with 16 CPU and 64GB does not always have exactly 64GB of memory. During development, we noticed at least three different memory configurations that Kubernetes reported:

  • 63461008Ki, which is 60.5Gi
  • 64461008Ki, which is 61.47Gi
  • 65149128Ki, which is 62.13Gi

Ki, or Gi suffixes are with the power of 2. So 1Gi == 1024Mi == 1024 * 1024 Ki. See the kubernetes docs for reference

As a consequence nodes of the same time can have a memory difference of 1669Mi, which is huge. This wouldn’t be a problem if we just increased our buffer by a couple of Gigabytes in our calculations. However, that would be a pity as we would love to give our users all the resources available on any given machine. In the end, we monitored the resources available for a while and used the lowest number in our calculation. Everything ran fine for 2 years, but in the last months, we again noticed some scheduling issues.

The impact on the Kubernetes cluster autoscaler

The reason for these scheduling issues is that in the last couple of months sometimes an instance popped up with a couple of 100Mbs less than we used to see. The result is that the Kubernetes cluster autoscaler thought no new node of that specific AWS instance could fit our newly scheduled pods and thus stopped upscaling until it was restarted.

For the first instance, we have 64Gi of memory and the overhead (red), and requested memory (yellow) to not overlap. When we go to an instance with a bit less space 63.5Gi, we suddenly see overhead and requested memory overlap (orange). This means the pod can not be scheduled.

The reason for this can be seen in the above picture. The remaining node memory (total capacity minus the overhead in red) is a bit smaller than the memory requested for our application, our pod cannot be scheduled anymore.

The result is that the autoscaler will decide that upscaling will not help anymore because the pod would not fit on any new node of that type. The autoscaler takes the least memory available as a template in future calculations. The only solution is to restart the autoscaler manually because these values are stored in memory. Afterward, autoscaling can resume at least when the node with insufficient memory is removed from the cluster.

How to make sure autoscaling continues

Since we saw this happening multiple times, we made the following changes:

  • When we see a node with less memory than statically configured we update the memory used in our calculation, that way we can correct the memory specification for the next application that needs to be scheduled
  • We send out an alert so we know the memory statically configured is wrong, that way we can correct it in our code and rollout the new value to all customers

Doing it this way makes sure autoscaling activities continue to keep working without manual intervention.

Want to know more about Conveyor

Thank you for listening, all of the above is fixed and implemented in Conveyor. If you want to know more about Conveyor, take a look at our website or our YouTube channel for:

If you want to try it out, use the following link. From there you can get started for free in your own AWS account with a few easy steps.

--

--