Tagged in

Monitoring

OpsOps

Tech blog on Linux, Ansible, Ceph, Openstack and operator-related programming

More information

Followers

630

Elsewhere

More, on Medium

Monitoring

George Shuklin in OpsOps

Jun 16

Reducing cardinality load from node_systemd_unit_state

We have few alerts for the most critical services, which explicitly checks if service is not failed. This require us to use systemd collector for node exporter, and it gathers a lot of series.

George Shuklin in OpsOps

Dec 1, 2023

There is an oom kill count in Linux!

I’ve missed this completely. Linux is exposing oom_kill value in the /proc/vmstat. It’s there since 2.6.36.

It is read by node exporter as node_vmstat_oom_kill metric.

George Shuklin in OpsOps

Jul 15, 2023

How to parametrize dashboard in Grafana

How to use $DS_PROMETHEUS and don’t get error ‘not found’

The problem I’ve just solved was to use $DS_PROMETHEUS as datasource for a provisioned dashboard.

George Shuklin in OpsOps

Mar 20, 2023

Alerts history in Prometheus

Today I found that Prometheus has history of all alerts (including pending state), and it’s right there, inside Prometheus itself.

Up to this point we had used a rather complicated alerts logger, but, turned out, it’s completely redundant.

George Shuklin in OpsOps

Jan 11, 2023

Network exec in systemd

Today I’m investigating an odd circular dependency in systemd units in openstack-ansible. One of the units under investigation implements a really interesting way to run database checks. I decided to dig deeper to better understand that technique.

George Shuklin in OpsOps

Oct 27, 2019

Hammering nails into Kapacitor coffin

TICK is dead

Sometimes there are sad moments when you realize you’ve chosen the wrong tech. It takes time to internalize a new shiny tech, to get some expertise in it. That expertise later yield only one meaningful insight…

2 responses

George Shuklin in OpsOps

Feb 15, 2019

Avoiding combinatorial explosion in continuous queries in Influx

How not to do CQs, and how to do them right

My small development environment suddenly stops producing performance data. I looked into few places and quickly…

2 responses

George Shuklin in OpsOps

Feb 12, 2019

Sending metrics directly from Ceph to Influx

(no telegram plugin is required)

After this sad bug in telegram (some Ceph metrics are missed for newer version of Ceph) I was delighted to find that Ceph (ceph-mgr to be precise) can send metrics directly…

1 response

George Shuklin in OpsOps

Feb 4, 2019

no data in retention policy in Kapacitor

Mild WTF which had cost me 1 hour of my life

If you suspect that your task does not receive data, check Kapacitor configuration file.

George Shuklin in OpsOps

Sep 27, 2018

kapacitor-unit — almost perfect

Testing a foundation for stable complexity. The more complex something is the higher is demand for stability, and a single industry-proven way for stability in large (complex) projects is testing.

1 response