Reducing cardinality load from node_systemd_unit_state

George Shuklin
OpsOps
Published in
1 min readJun 16, 2024

We have few alerts for the most critical services, which explicitly checks if service is not failed. This require us to use systemd collector for node exporter, and it gathers a lot of series.

Like a lot lot. On one installation I found that it accounts for 50% of all series in that specific Prometheus.

Whilst I can’t just throw it away (we have alerts), I can reduce it 5-fold.

Here how: When node exporter is scraped, we can drop all series which are for metric node_systemd_unit_state, but convey different states (like ‘activating’, etc). There are a lot of them, and reducing cardinality just for few states, mean a giant reduction in total cardinality.

The best place to drop excessive series is metric_relabel_configs.

metric_relabel_configs:
- source_labels: [__name__, state]
separator: '_'
regex: node_systemd_unit_state_[^f].*
action: drop

Note this silly trick to reduce problems coming from ‘keep’ action (which drops everything else). We join values of two labels (__name__ and state) and checking if resulting string is node_systemd_unit_state_, but not node_systemd_unit_state_f…(including node_systemd_unit_state_failed), and if it is, drop it, essentially removing series like

node_systemd_unit_state{
environment="staging",
instance=”10.10.10.10:9100",
job="node-exporter",
name="accounts-daemon.service",
service="node-exporter",
state="activating”,
type="dbus"
}

Which is creating crazy cardinality. The solution does not remove problem entirely, but it reduces it 5x, which is more than enough for most practical reasons.

Future reductions may include additional filtering on the name label. Most people does not need thousands name=”blockdev@dev-sda15.target” , name=”getty-pre.target” , name=”systemd-ask-password-wall.path”, etc.

--

--

George Shuklin
OpsOps

I work at Servers.com, most of my stories are about Ansible, Ceph, Python, Openstack and Linux. My hobby is Rust.