We have few alerts for the most critical services, which explicitly checks if service is not failed. This require us to use systemd collector for node exporter, and it gathers a lot of series.
I’ve missed this completely. Linux is exposing oom_kill value in the /proc/vmstat. It’s there since 2.6.36.
oom_kill
/proc/vmstat
It is read by node exporter as node_vmstat_oom_kill metric.
node_vmstat_oom_kill
The problem I’ve just solved was to use $DS_PROMETHEUS as datasource for a provisioned dashboard.
$DS_PROMETHEUS
Today I found that Prometheus has history of all alerts (including pending state), and it’s right there, inside Prometheus itself.
Up to this point we had used a rather complicated alerts logger, but, turned out, it’s completely redundant.
Today I’m investigating an odd circular dependency in systemd units in openstack-ansible. One of the units under investigation implements a really interesting way to run database checks. I decided to dig deeper to better understand that technique.
Sometimes there are sad moments when you realize you’ve chosen the wrong tech. It takes time to internalize a new shiny tech, to get some expertise in it. That expertise later yield only one meaningful insight…
My small development environment suddenly stops producing performance data. I looked into few places and quickly…
After this sad bug in telegram (some Ceph metrics are missed for newer version of Ceph) I was delighted to find that Ceph (ceph-mgr to be precise) can send metrics directly…
If you suspect that your task does not receive data, check Kapacitor configuration file.
Testing a foundation for stable complexity. The more complex something is the higher is demand for stability, and a single industry-proven way for stability in large (complex) projects is testing.