Kubernetes PromQL (Prometheus Query Language) CPU aggregation walkthrough

Ami Mahloof
7 min readApr 23, 2018

--

Note:
I am pretty much a beginner with PromQL but have been using a lot of graphite and influxdb queries. this is how I understand this query language and if I am wrong on any part please feel free to comment and I’ll correct it.

Prometheus comes with its own query language called PromQL, Understanding PromQL is difficult, let alone the scary syntax — especially if you are supposed to come up with queries on your own.

I‘m not going to cover how to install and configure Prometheus here, the easiest way is via a helm chart, I’m going to walk you through a query parts until we get the desired output.

I like to approach huge and scary tasks by breaking it into small chunks — it helps me understand exactly what I’m doing while getting confidence in accomplishing the goal.

So lets break it into very small chunks:

What do we have:
Multiple pods running in replica / deployment often spreading into multiple hosts.

What do we want to monitor:
We want to aggregate the cpu usage by a pod label called app.

Where do we start:
cAdvisor (Container Advisor) has the resources (cpu, memory, etc) monitored and exported as metrics, every Kubernetes node runs kubelet which has cAdvisor is compiled into it.

For the sake of simplicity all the pods we want to collect metrics on are on the same namespace.

The tools:

  • Prometheus-server port-forwarded from the local computer.
  • Simple cURL with jq that’s all you really need.
  • Grafana — Visualization.

Query building:

What we want to end up with is a graph per cpu by the deployment:

Step 1 — Get the cpu usage per container:

We’ll start with the most simple metric container_cpu_usage_seconds_total
if we only run this we’ll get all the containers in all namespaces and we will get a metric every second, this isn’t what we want. we want to aggregate these in order to get a “CPU per second” metric

Note how the metric name container_cpu_usage_seconds_total has a suffix of seconds_total, this indicates that the metric is an accumulator. if we want to get the usage per second we need to add a function that will produce that.

The rate(v range-vector)[time]function takes a range vector. it calculates the average over time.

Now we can filter by a name space and get the cpu usage per second on a 5 minutes average.

You can use the following to run your query from the command line:

  1. make sure you have a port foraward to your promethues server
  2. run the following command
curl -s http://127.0.0.1:9090/api/v1/query\?query\=container_cpu_usage_seconds_total | jq

which return a result like this:

a few things to notice here:

  1. The cpu label value is cpu00, this means that the containers might be running on different cpu’s too.
  2. Each metric only has the pod_name but missing the pod labels, this means that we don’t have a label that can aggregate all pods on a deployment / replicaSet.

If we get the number of pods for that name space using kubectl get pods, we will get 11 pods, but the metrics above will show more entries since we are looking at containers not pods.

to aggregate the results we got by pod_name we add the function sum()

now it will match the number of pods we saw using the kubectl get pods, and we will get the cpu by pod.

If you want feel free to add another label “instance” (comma separated) to the sum by section, which will aggregate by host as well.

However, we are not entirely where we want to be.
we want to combine all the pods cpu_usage per deployment / replicaSet, the common element that every pod member of deployment has is it’s labels.
here the pod label we want is app, so now we need to find the labels of the pods to further aggregate them.

Step 2 — Get the pod labels:

We can get the pod labels via the kube_state_metrics by running the query:

labels do not produce a metric value so they have the value of 1 (exists)

Notice that here we have another label of “pod” which is the same as the “pod_name” from the container_cpu_usage_seconds_total query, we could use that as the joining element on both queries.

We don’t care about all the other labels right now, all we need is label_app and pod.

If we want to get only a few labels in a result, we need to use the by clause (similar to group by in SQL). in order to do that it has to be a part of an aggregation function, so we will use the max() function to return these labels, using max() will keep the values at 1.

Step3 — Prepare the results for a one-to-many match

Now we have 2 sets of results:

  1. The many side — the cpu usage results containing a list of metrics, each with a label of “pod_name”.
  2. The one side — the pod labels results containing a list of metrics, each with a label of “pod” which is the same value as the “pod_name” from the previous results.

one set of results from kube_pod_labels as opposed to many results for cpu_usage.

The problem is when trying to match these 2 results (join) PromQL needs the exact labels to exists in the same set or else the combined result will be empty (no match).

In order to do that we need to replace the label pod_name with pod
to do that we will use the following function:

label_replace(
<vector_expr>, "<desired_label>", "$1", "<existing_label>", "(.+)"
)

so let’s plug these values into the place holders:

the label_replace is a bit misleading as it essentially adds a label rather than replacing one.

Part 4 — Join the results one-to-many match:

First we need to understand that the results sets type is a vector — we are looking at a set of timeseries not a single one , so in order to join these we need some sort of binary operation on them.

prometheus has the following operators:

  • Arithmetic Binary Operators: +, -, *, /, %, ^
  • Comparison Binary Operators: ==, !=, >, >=, <, <=
  • Logical Binary Operators: and, or, unless

so basically our query will take the pod labels which has a value of 1 and if all we want to end up with is these pod labels combined with the cpu_usage labels we simply need to multiply one vector by the other.

the matching syntax is:

<vector expr> <bin-op> ignoring/on(<label list>) group_left/right(<label list>) <vector expr>

Let’s break it up:

  1. Start with the kube pod labels on the left
  2. Multiply (*)
  3. on (pod) — since we want to match only by the pod label from the cpu_usage metric labels and ignore the rest.
  4. group_right(label_app) — this is the actual join (if match found) the labels passed to the group_right are explicitly been added from the one side (the kube pod labels) to the result.

    the group_left / group_right is determined by the result set that has a higher cardinality, which means the one that has more metric labels variations. here we do group_right since the cpu_usage metrics has the higher cardinality.

We end up with what we wanted the pod label for every pod name and the cpu_usage value, however we’re not quite there, we need to aggregate those results by the label_app.

To do that we wrap the entire query with sum() by (label_app)

The final query:

The visualization part — Grafana:

Open Grafana, create / edit existing dashboard and add to it a graph panel.

Plug in the query we wrote.

As the results contains the key label_app and the value for the label_app it will appear in grafana as {label_app=”redash-celery-scheduler-python-daemon”}

Add to the legend section a filter to remove that label_app:
{{ label_app }}

This will result in the value for the label_app.

This is it, there is much more to learn on promQL but I do hope that you enjoyed this walkthrough and now understand how to use it with more confident.

--

--