We have few alerts for the most critical services, which explicitly checks if service is not failed. This require us to use systemd collector for node exporter, and it gathers a lot of series.
Tools:
I’ve missed this completely. Linux is exposing oom_kill value in the /proc/vmstat. It’s there since 2.6.36.
oom_kill
/proc/vmstat
It is read by node exporter as node_vmstat_oom_kill metric.
node_vmstat_oom_kill
The problem I’ve just solved was to use $DS_PROMETHEUS as datasource for a provisioned dashboard.
$DS_PROMETHEUS
Today I found that Prometheus has history of all alerts (including pending state), and it’s right there, inside Prometheus itself.
Up to this point we had used a rather complicated alerts logger, but, turned out, it’s completely redundant.
I’ve tried to find a way to gather metrics from iscsi TGT. I found nothing. No metrics, no exporters, no counters. Documentation is silent, internet is silent. It sounded like there is no metrics from tgt-admin or tgtadm or tgtd at all.
There was an endless pain on templatization of yaml files containing go-inspired string interpolation. The most known culprits are Prometheus and Kapacitor. The Prometheus is the most outstanding because it has yaml as format…