How to save money fast with Kubernetes — Do FinOps

Xavier Baude
ADEO Tech Blog
Published in
5 min readNov 8, 2022
Credit https://www.nasa.gov/ — The Far Side of the Moon

At ADEO, my team is responsible for managing large Kubernetes clusters, mostly.in Cloud environments. Now that our apps have been deployed, it is time to optimize the consumption and the cost!

Cloud operating costs are not always the major preoccupation of app developers. Many factors come into play, making cost evaluation as opaque as the far side of the moon. We propose a set of tools supporting the FinOps methodology. The key is to monitor each app’s operational requirements and fine-tune the infrastructure to avoid over-provisioning.

In this paper, we describe how to save money on the cloud and share tips on cloud infrastructure optimization.

Step 1: Measure and log your resource consumption

To understand what you gain or lose, you must measure how much the app costs, how much resources are consumed and who consumes those services.

For each app, the operations team works out the set of metrics to measure the usage and the cost of running the app. On GCP, this amounts to CPU and memory consumption in our current simplified model. For the next year, we will work on more precise metrics.

We created a tracking system to monitor detailed resource consumption hour by hour using custom scripts deployed as cronjob inside each cluster. Then, we aggregate these values into a BigQuery database, enriching the data with details on Business Units, platforms, domains, products, etc. Month after month, our tool logs consumption and proposes recommendations based on real consumption and reservation.

Internal invoice tool illustration

Of course, everything can be filtered and we now have several years of history.

Step 2: Optimize pod configuration based on resource requests and limits

The main optimization is related to pod configuration: resources requests and limits.

In addition to the invoice view presented above, we offer an operational dashboard implemented with Datadog allowing us to centralize all Kubernetes metrics. The latter can also be collected with open-source tools like Prometheus and Grafana, although with some hassle.

Pods are configured with CPU and memory requests. We take the example of a rather classic Java app managed by an ADEO team:

    resources:
requests:
cpu: "2"
memory: 2Gi

After deploying this pod, the operational dashboard shows how much actually Kubernetes reserves in terms of CPU / Memory:

The graph on the left shows CPU reservation vs usage. On average, the pods are using only 5.52% of the CPU reserved. The graph on the right shows Memory reservation vs usage. On average, the pods are using 46.08% of the amount of memory reserved.

Clearly, we are over-reserving resources and wasting money on unused resource capacity on the cloud. We would work together with the Product teams to reduce the resource reservation of each app and find a good balance between resource reservation and effective average usage on different workloads. For example, we change previous application values to:

    resources:
requests:
cpu: 200m
memory: 1100Mi

By decreasing resource reservation in the pod specification (the blue graph) from 1 core to 0.6 on average, the result now is that the reservation and effective average usage are now close enough, while still keeping a reasonable safety margin. Finding the right values requires testing, measuring, and more testing under different loads.

Step 3: Repeat for all apps

We evangelize teams with dashboards and key metrics, so they can immediately see the gains. Also, we organize meetings or create a notification system, etc. Our goal is to make the user aware of possible economies by fine-tuning resource parameters.

In addition to the resource graphs, our complete Datadog dashboard includes 2 tables

Datadog dashboard, full view

The thresholds we have chosen are:

  • 0 -> 40% = RED this means that you over provision resources. Consider reducing the requests on your deployment. It means that pods requests are too high rather than the usage.
  • 40 -> 100% = GREEN means you have an optimized configuration. Usage and requests are close enough.
  • 100% and beyond = YELLOW means you consume more resources that you reserve for your pod. Consider adding more reservations. Indeed, maybe you will suffer from the Quality of Service of kubernetes scheduling.
Thresholds chosen for FinOps

We know that the application evolves over time: its consumption changes during the day and at night. The proposed 0–40%, 40–100%, and +100% thresholds are arbitrary and serve as a guide for the teams.

Going further

FinOps doesn’t just stop at fine-tuning pod resources on Kubernetes. To go further, you may consider:

  • A more accurate “internal invoicing” dashboard. This dashboard is sent to the product teams directly, so they know the costs hour by hour.
  • Decrease pods provision at night and on weekends, depending on your real business requirements. By keeping the apps online only from 8:00 a.m. to 7:00 p.m. Monday to Friday, instead of 24/7, we save 70% on the total bill.
  • Use Kubernetes autoscaling. The HPA can start pods with only 1 replica and scale up if the load increases. You might consider external metrics for metrics more suited to your needs. We also started to use WPA (from Datadog) as an alternative to HPA for some use cases.
  • Also setting up limitRange and Resource Quotas on namespaces would lead to limiting the number of pods within a namespace.
  • Check whether or not the “safe-to-evict” option is activated on the cluster. It may impact your application, be sure to read doc carefully.

I hope these tips will be useful for you! I’d love to talk with you about this: don’t hesitate to contact me on LinkedIn!

--

--

ADEO Tech Blog
ADEO Tech Blog

Published in ADEO Tech Blog

Discover tech life at ADEO throughout our expert’s stories and our products. This is how we act to make home a positive place to live.

Xavier Baude
Xavier Baude

Written by Xavier Baude

Tech Lead at ADEO. I've loved Kubernetes since it was minion 😻.

No responses yet