Reducing Cloud Costs of Kubernetes Clusters

Published in

adidoescode

11 min readJun 28, 2024

Introduction

With the global economy facing challenges and environmental concerns growing, optimizing cloud resource usage has become a hot topic. By efficiently managing resources in Kubernetes clusters, companies can not only save money, but also reduce the clusters’ carbon footprint.

There are many great open-source tools that can help achieve that.

In this article, we will explain how we at adidas managed to reduce the costs of running Kubernetes clusters in AWS by up to 50%. We will talk about the tools we used, how we configured them, and the challenges we encountered during the early stages.

The article is intended for everyone who has experience with Kubernetes. While some details are AWS-specific, most of the things described in the article work with all cloud providers.

Interior of the adidas Flagship Store in London, from adidas archive

Getting cheaper EC2 instances

To lower the costs of EC2 instances, we started using Karpenter — a cluster autoscaler that provisions and removes nodes based on application demand. At the moment, it supports AWS only, although it may be possible to use it with other cloud providers in the future.

Karpenter allows the use of instances of various types and sizes and selects the most suitable instances to ensure the nodes are utilized as efficiently as possible. It would also remove underutilized nodes and move workloads to smaller and cheaper instance types whenever possible (this is called consolidation).

Another important feature of Karpenter is its ability to use spot instances. Spot instances in AWS are unused compute capacity that is offered at a significantly lower price compared to regular on-demand instances. Karpenter searches for spot instances with the lowest price and the lowest risk of interruption (i.e. the chance of being removed when AWS needs them back).

Creating VPAs automatically

Then, we focused on improving resource utilization for applications running in our clusters. This included optimizing container requests and limits, and adjusting the number of replicas when the applications are idle.

To optimize resource usage, we started creating Vertical Pod Autoscalers (VPAs) automatically for all workloads in development and staging clusters.
There are several tools that offer this functionality, but most either require a subscription or lack important customization options. We aimed for a solution that was free, flexible, and easy to maintain.

To generate default VPAs we decided to use Kyverno. Kyverno is a policy tool that allows to validate and change Kubernetes resources.
Kyverno is often used to improve application security, and we’ve already been using it for this exact purpose. Using a security tool to create VPAs might seem a bit non-conventional, but it worked great in this use case.

For each new Deployment, StatefulSet or DaemonSet our Kyverno policy checks the following:

Whether the resource already has an HPA or another VPA.
If automatic VPA creation is allowed for this resource and its namespace. We control this using a dedicated label.

If these conditions are met, Kyverno generates a new VPA for the resource.

The policy manifest is available on GitHub. Click here to view it.

Setting default values

However, there is one challenge: How should we configure the VPAs without knowing anything about the applications? While we want to save money, we also don’t want to break the applications.

VPA allows to control either resource requests only, or both requests and limits.
Adjusting requests only can help to better manage occasional resource usage spikes. Lower requests would lead to reduced costs, while original limits would help to prevent CPU throttling and OOM issues during spikes.
However, if user-defined limits are below what the application needs, a VPA that controls requests only will not be able to raise requests above those limits. In such cases, it might be beneficial to control both requests and limits.

Both approaches are valid and the choice depends on the use case. We decided to control requests only to avoid disruptions during usage spikes.

For the minAllowed values, we selected some very small values, such as 10 millicores for CPU and 32 megabytes for memory. This allows applications to scale down in case the original requests are too high.

For maxAllowed values, there are three main options:

Setting them to the original requests specified in the resource.
This way the VPA would only be able to decrease the requests, not increase them. This wouldn’t work well if the user-defined requests are too low, because the VPA wouldn’t be able to solve issues caused by it. However, even if the original requests are not high enough, the application would just use its original values, as it would without any VPA. This means that creating a VPA wouldn’t make the situation worse.
Setting them to the original limits specified in the resource.
In this case, the VPA would be able to both decrease and increase requests. It would allow to deal with CPU throttling and also potentially make nodes more stable by preventing them from going out of memory. Such OOM scenarios might occur if there is a big difference between the memory limits and the requests, and the memory usage is closer to the limits than to the requests. By adjusting requests and setting them closer to actual usage, the VPA would ensure that all applications on the node have enough memory to run.
The downside of setting maxAllowed to limits is that it can increase the costs of some applications. Some application teams might intentionally set low CPU requests in non-production clusters, because they prioritize cost over performance.
Setting them to some high values.
This approach would ignore resource requests and limits set by users and consider only resource usage. This might lead to increased costs. If an application has memory leaks or is not well optimized, it would be harder for application developers to notice and address these issues.
This approach would work only if the VPA controls both requests and limits, or if limits are not defined at all. The selected maxAllowed values should be lower than the capacity of the nodes in the cluster.

The choice depends on the nature of the applications running in the cluster, the environment (production or non-production), and whether you prioritize cost savings over application performance. In the policy below, maxAllowed is set to the requests.

The first two approaches work for a single container, but what should we do if the application has multiple containers? How do we set maxAllowed in such cases?

Unfortunately, there is no easy way to calculate the sum of resource requests and limits or to find the highest requests in a Kyverno policy. We also can’t simply use the values set for the first container, because they might be significantly lower than those of other containers, and this could potentially break the application. Therefore, we decided not to specify maxAllowed at all when the application has multiple containers.

Results

Overall, creating default VPAs worked well for most applications, but there are certain limitations to consider.

First, VPAs can’t work with Horizontal Pod Autoscalers (HPAs), because both try to scale applications based on resource metrics. It’s possible to make them work together only by using custom metrics, and we are going to address this topic later in the article.

Second, VPAs might not work well with older Java applications, since a VPA can’t adjust the heap size passed as parameters to the container. It’s also difficult to get accurate memory usage metrics of JVM-based workloads, because the reported usage might differ from the actual usage.

Third, there are some applications, such as performance testing tools, that need to run without disruptions and would encounter issues if VPA tries to restart them. While this problem can be mitigated by using the VPA in Initial mode, our default VPAs are configured to use Auto mode.

This is why it’s important to give application teams the option to opt out. It is possible by adding a specific label to the resources they want to exclude from the VPA creation process.

Creating default VPAs helped us to achieve 30% savings in CPU and memory across all development and staging clusters.

Resource requests before and after VPAs creation

The picture above shows what happened in one of our biggest staging clusters when we created VPAs for most of the workloads.

Scaling in outside of office hours

Another measure that helped us to reduce compute hours (and therefore costs and CO2 footprint) was decreasing the number of replicas for all applications during non-office hours, when nobody is using them.

To achieve this, we use kube downscaler. It’s a tool designed to scale applications in and out based on a predefined schedule. To start scaling an application, one needs to add a couple of annotations to its manifest or to the entire namespace. Here is an example:

annotations:
  downscaler/downtime-replicas: "1"
  downscaler/uptime: Mon-Fri 08:00-19:00 Europe/Berlin

In this example, the application would be scaled down to 1 replica during nights (between 7 pm and 8 am) and on weekends.

By default, we scale applications to 1 replica, but application teams have the option to scale to 0 replicas, adjust the uptime/downtime window, or opt out entirely.

It’s important to note that if an HPA is in place and the application needs to be scaled to 0, the annotation should be added to the Deployment or StatefulSet rather than the HPA.

The following graph shows the number of nodes over a 2-week time frame in one of our staging clusters. One can see that the number significantly drops during the nights and then increases again in the mornings. The two longer periods with reduced number of nodes are weekends.

The impact of kube-downscaler on the number of nodes

Scaling based on external metrics

Sometimes, resource metrics don’t accurately reflect the actual load that an application is experiencing.

Additionally, HPAs do not allow scaling to 0 replicas based on resource metrics. The reason for it is this: If an application has 0 pods, it’s not utilizing any resources and therefore cannot generate resource metrics, so the HPA doesn’t have enough information to make a decision.

In such cases it would be helpful to scale based on other types of metrics.

To scale based on external metrics we are using KEDA. KEDA stands for Kubernetes Event-driven Autoscaling and allows to scale applications based on metrics from many different sources, including Prometheus and Apache Kafka.

Furthermore, with the right metric, KEDA allows applications to scale to 0 replicas. In this case it’s important to use metrics that are originating from sources other than this application, since, as mentioned earlier, an application with 0 replicas cannot generate metrics. A common use case is scaling Kafka consumers by utilizing consumer group metrics like consumer lag.

The use of custom metrics also makes it possible to use both HPA and VPA simultaneously. This allows for vertical scaling based on resource metrics and horizontal scaling based on custom metrics.

Ensuring that nodes actually get removed

After implementing all these measures, we quickly encountered a problem: we had a lot of half-empty nodes that weren’t getting removed. It turned out that something was preventing Karpenter from removing underutilized nodes.

A lot of applications had Pod Disruption Budgets (PDBs) configured in such a way that didn’t allow any pods to be deleted and rescheduled.

This happened because those applications were using production configuration in development clusters. For instance, if an application only has one pod (a common scenario in non-production environments), but the PDB is configured to ensure that 60% of pods are up at all times, the deletion of this single pod would be blocked. As a result, many nodes remained almost empty but could not be removed because some pods on those nodes had PDBs and therefore couldn’t be rescheduled.

To solve this issue, we implemented a Kyverno policy that checks all new PDBs to ensure the following:

minAvailable value is not 100%. Setting minAvailable to 100% would require all pods of the application to be up at all times, preventing any pods from being removed.
maxUnavailable value is not 0 or 0%. Similar to the previous case, setting maxUnavailable to 0 would prevent any pods from being rescheduled.
The application has more than 1 replica. If only one replica is present, a PDB would likely set minAvailable to 1, preventing pod deletion.
If an HPA is present, its minReplicas value is higher than 1. Setting it to 1 would result in only a single pod being present during off-peak hours, and the PDB would block its deletion.

You can find the full policy manifest on GitHub.

At the moment, Kyverno doesn’t offer functionality to calculate the allowed disruptions in PDBs. While our policy blocks most of the problematic PDB configurations, it still misses some of them.

For example, the policy checks if minAvailable is not set to 100%. This rule can be easily circumvented by setting the value to 99%. The PDB would pass the policy check and be created, even though it would still prevent any disruptions (unless the application has at least 100 replicas) and block node deletion.

To address this, we implemented a cleanup policy that takes care of the remaining “wrong” PDBs.

---
apiVersion: kyverno.io/v2alpha1
kind: ClusterCleanupPolicy
metadata:
  name: pdb-cleanup
spec:
  match:
    any:
    - resources:
        kinds:
          - PodDisruptionBudget
  conditions:
    any:
    - key: '{{ target.status.disruptionsAllowed }}'
      operator: Equals
      value: 0
  schedule: "15 10,17 * * *" # run twice a day, at 10:15 and 17:15

This cleanup policy is applied twice a day to remove all PDBs that violate the rules.

Important note: This cleanup should not be performed at the same time with cluster upgrades. During cluster upgrades, even valid PDBs may temporarily prevent disruptions as a result of pod rescheduling. In such cases, the cleanup policy might delete these valid PDBs.

Conclusion

The measures described above were automatically implemented in non-production clusters only, except for the PDB policies, which were deployed everywhere. The implementation took us three months, and now we are paying 50% less in monthly costs across all development and staging clusters.

In production clusters, we are using an opt-in model rather than an opt-out approach, meaning that we let application teams decide which of these tools they wish to use and how they want to configure them.

When starting cost optimization, one should keep the following things in mind:

Check node capacity
If all containers use less resources, it means there will be more pods per node. In our case, after implementing these changes, each node was running twice as many pods as before. As a result, there wasn’t enough space to store all the Docker images. It’s important to ensure that nodes have enough capacity to handle the increased number of applications.
Find the right settings
By default, the VPA recommender uses the 90th percentile for both CPU and memory. Initially, we decided to set this value to the 80th percentile to maximize savings. While this was sufficient for most applications, for some, the recommended values turned out to be too low. We ended up reverting to the default 90th percentile. In terms of cost savings, the difference between using the 80th and 90th percentiles in the VPA configuration was not very significant.
Inform the users
Implementing these measures doesn’t require any actions from the application teams. However, all cluster users need to be informed about the changes. Without this information, it might be very difficult to determine the cause of disruptions in case of incidents.
Monitor everything
It’s good to have proper monitoring in place to be able to measure the impact. Examples of useful metrics include monthly, weekly, and daily costs of EC2 instances, the number of nodes in the clusters, the number of pods per node, the total amount of requested resources, and the total node capacity.

Finally, cost optimization is a continuous process that requires constant adjustments.

This concludes the article. We hope our experience inspires you to try out these tools and adapt them to your specific needs, making your cloud infrastructure more cost-efficient and eco-friendly.

The views, thoughts, and opinions expressed in the text belong solely to the author, and do not represent the opinion, strategy or goals of the author’s employer, organization, committee or any other group or individual.