Monitoring persistent volumes using prometheus and node exporter in k8s

In this post I am going to share how we ended up failing to monitor the disk usage in prometheus using node exporter and ended up with a filled up disk in production.

In prometheus, disk can be monitored using node exporter. We were using k8s helm charts to spin-up prometheus and run node exporter as DaemonSet in the k8s node. Everything was working fine up until one day the prometheus stopped working and we weren’t getting any data. So after some digging realized the prometheus Disk was filled up and our alerts weren’t firing for that specific rule.

The rule was

ALERT DiskWillFillIn16Hours
IF predict_linear(node_filesystem_free{device=~"/dev/xv.+"}[16h], 16*3600) < 0
FOR 10m
LABELS {
severity="page"
}
ANNOTATIONS {
summary = "Node will run of disk space soon.",
description = "{{ $labels.node }} will be soon out of disk space.",
}

One of the coolest thing about prometheus is that we could do prediction based monitoring on the resources. Example being predict_linear.

Our k8s cluster run in aws and all our EBS volumes attach to the mount point /dev/xv.

We tested the rule again by filling up the hard disk and the rule fired. We were confused as to why it didn’t fire for that volume. And that is when we started doing some research ran into this.

The node_exporter is designed to monitor the host system.
It’s not recommended to deploy it as Docker container because it requires access to the host system.
https://github.com/prometheus/node_exporter/issues/47

Learnt the hard way. We use kops to bring up cluster. So we ended up creating a custom ami with packer with node exporter on the base image and then applying the images to the instance groups in kops.

Here is the script that initializes the nodeexporter that we use in the packer.

#!/bin/bash
#This script is to run nodeexporter as on the bare-metal instead of in a container.
# more information https://github.com/prometheus/node_exporter/issues/474
#
# From the nodeexporter repo in the README.md
# The node_exporter is designed to monitor the host system.
# It's not recommended to deploy it as Docker container because it requires access to the host system.
#
set -e

wget https://github.com/prometheus/node_exporter/releases/download/v0.14.0/node_exporter-0.14.0.linux-amd64.tar.gz
sudo tar -xf node_exporter-0.14.0.linux-amd64.tar.gz -C /opt
sudo mv /opt/node_exporter-0.14.0.linux-amd64/ /opt/node_exporter
echo -en '[Unit]\nDescription=Node Exporter\n\n
[Service]\n
ExecStart=/opt/node_exporter/node_exporter -collector.diskstats.ignored-devices="^(ram|loop|fd)\\d+$" -collector.filesystem.ignored-mount-points="^/(sys|proc|dev)($|/)"\n\n
[Install]\n
WantedBy=default.target ' | sudo tee /etc/systemd/system/node_exporter.service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter.service
sudo systemctl start node_exporter.service
curl http://localhost:9100/metrics
rm node_exporter-0.14.0.linux-amd64.tar.gz
One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.