How to reduce your Prometheus cost
Prometheus is an excellent service to monitor your containerized applications. Still, it can get expensive quickly if you ingest all of the Kube-state-metrics metrics, and you are probably not even using them all. This is especially true when using a service like Amazon Managed Service for Prometheus (AMP) because you get billed by metrics ingested and stored.
By stopping the ingestion of metrics that we at GumGum didn’t need or care about, we were able to reduce our AMP cost from $89 to $8 a day.
In this article, I will show you how we reduced the number of metrics that Prometheus was ingesting. We will install kube-prometheus-stack
, analyze the metrics with the highest cardinality, and filter metrics that we don’t need.
Install kube-prometheus-stack
We will be using kube-prometheus-stack
to ingest metrics from our Kubernetes cluster and applications. We assume that you already have a Kubernetes cluster created. In my case, I’ll be using Amazon Elastic Kubernetes Service (EKS).
First, add the prometheus-community helm repo and update it. Then create a namespace, and install the chart. I am pinning the version to 33.2.0 to ensure you can follow all the steps even after new versions are rolled out.
helm repo add prometheus-community https://prometheus-community.github.io/helm-chartshelm repo updatekubectl create ns prometheushelm upgrade -i prometheus prometheus-community/kube-prometheus-stack -n prometheus — version 33.2.0
Analyze metrics
The next step is to analyze the metrics and choose a couple of ones that we don’t need. For this, we will use the Grafana instance that gets installed with kube-prometheus-stack
.
# Access Grafanakubectl port-forward service/prometheus-grafana 8080:80 -n prometheus# Default user and passwordadminprom-operator
Grafana is not exposed to the internet; the first command is to create a proxy in your local computer to connect to Grafana in Kubernetes.
After that, you can navigate to localhost:9090
in your browser to access Grafana and use the default username and password.
Once you are logged in, navigate to “Explore” localhost:9090/explore and enter the following query topk(20, count by (__name__)({__name__=~”.+”}))
, select Instant, and query the last 5 minutes.
You should see the metrics with the highest cardinality. In our example, we are not collecting metrics from our applications; these metrics are only for the Kubernetes control plane and nodes.
The first one is apiserver_request_duration_seconds_bucket
, and if we search Kubernetes documentation, we will find that apiserver is a component of the Kubernetes control-plane that exposes the Kubernetes API. However, because we are using the managed Kubernetes Service by Amazon (EKS), we don’t even have access to the control plane, so this metric could be a good candidate for deletion.
The same applies to etcd_request_duration_seconds_bucket;
we are using a managed service that takes care of etcd, so there isn’t value in monitoring something we don’t have access to.
Stop ingesting metrics
For our use case, we don’t need metrics about kube-api-server
or etcd
. So, in this case, we can altogether disable scraping for both components. The helm chart values.yaml provides an option to do this.
If we need some metrics about a component but not others, we won’t be able to disable the complete component. In that case, we need to do metric relabeling to add the desired metrics to a blocklist or allowlist.
Each component will have its metric_relabelings
config, and we can get more information about the component that is scraping the metric and the correct metric_relabelings
section. For example, a query to container_tasks_state
will output the following columns:
And the rule to drop that metric and a couple more would be:
Apply the new prometheus.yaml
file to modify the helm deployment:
helm upgrade -i prometheus prometheus-community/kube-prometheus-stack -n prometheus — version 33.2.0 — values prometheus.yaml
Summary
We installed kube-prometheus-stack
that includes Prometheus and Grafana, and started getting metrics from the control-plane, nodes and a couple of Kubernetes services. Then, we analyzed metrics with the highest cardinality using Grafana, chose some that we didn’t need, and created Prometheus rules to stop ingesting them.
Our final prometheus.yaml
file would be:
After applying the changes, the metrics were not ingested anymore, and we saw cost savings.