GumGum Tech Blog
Published in

GumGum Tech Blog

How to reduce your Prometheus cost

Prometheus is an excellent service to monitor your containerized applications. Still, it can get expensive quickly if you ingest all of the Kube-state-metrics metrics, and you are probably not even using them all. This is especially true when using a service like Amazon Managed Service for Prometheus (AMP) because you get billed by metrics ingested and stored.

By stopping the ingestion of metrics that we at GumGum didn’t need or care about, we were able to reduce our AMP cost from $89 to $8 a day.

Chart showing the costs for Managed Service for Prometheus became much lower

In this article, I will show you how we reduced the number of metrics that Prometheus was ingesting. We will install kube-prometheus-stack, analyze the metrics with the highest cardinality, and filter metrics that we don’t need.

Install kube-prometheus-stack

We will be using kube-prometheus-stack to ingest metrics from our Kubernetes cluster and applications. We assume that you already have a Kubernetes cluster created. In my case, I’ll be using Amazon Elastic Kubernetes Service (EKS).

First, add the prometheus-community helm repo and update it. Then create a namespace, and install the chart. I am pinning the version to 33.2.0 to ensure you can follow all the steps even after new versions are rolled out.

helm repo add prometheus-community repo updatekubectl create ns prometheushelm upgrade -i prometheus prometheus-community/kube-prometheus-stack -n prometheus — version 33.2.0

Analyze metrics

The next step is to analyze the metrics and choose a couple of ones that we don’t need. For this, we will use the Grafana instance that gets installed with kube-prometheus-stack.

# Access Grafanakubectl port-forward service/prometheus-grafana 8080:80 -n prometheus# Default user and passwordadminprom-operator

Grafana is not exposed to the internet; the first command is to create a proxy in your local computer to connect to Grafana in Kubernetes.

After that, you can navigate to localhost:9090 in your browser to access Grafana and use the default username and password.

Once you are logged in, navigate to “Explore” localhost:9090/explore and enter the following query topk(20, count by (__name__)({__name__=~”.+”})), select Instant, and query the last 5 minutes.

You should see the metrics with the highest cardinality. In our example, we are not collecting metrics from our applications; these metrics are only for the Kubernetes control plane and nodes.

The first one is apiserver_request_duration_seconds_bucket, and if we search Kubernetes documentation, we will find that apiserver is a component of the Kubernetes control-plane that exposes the Kubernetes API. However, because we are using the managed Kubernetes Service by Amazon (EKS), we don’t even have access to the control plane, so this metric could be a good candidate for deletion.

The same applies to etcd_request_duration_seconds_bucket; we are using a managed service that takes care of etcd, so there isn’t value in monitoring something we don’t have access to.

Stop ingesting metrics

For our use case, we don’t need metrics about kube-api-server or etcd. So, in this case, we can altogether disable scraping for both components. The helm chart values.yaml provides an option to do this.

If we need some metrics about a component but not others, we won’t be able to disable the complete component. In that case, we need to do metric relabeling to add the desired metrics to a blocklist or allowlist.

Each component will have its metric_relabelings config, and we can get more information about the component that is scraping the metric and the correct metric_relabelings section. For example, a query to container_tasks_state will output the following columns:

And the rule to drop that metric and a couple more would be:

Apply the new prometheus.yaml file to modify the helm deployment:

helm upgrade -i prometheus prometheus-community/kube-prometheus-stack -n prometheus — version 33.2.0 — values prometheus.yaml


We installed kube-prometheus-stack that includes Prometheus and Grafana, and started getting metrics from the control-plane, nodes and a couple of Kubernetes services. Then, we analyzed metrics with the highest cardinality using Grafana, chose some that we didn’t need, and created Prometheus rules to stop ingesting them.

Our final prometheus.yaml file would be:

After applying the changes, the metrics were not ingested anymore, and we saw cost savings.

We’re always looking for new talent! View jobs.

Follow us: Facebook | Twitter | LinkedIn | Instagram



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store