Live monitoring with Telegraf and Grafana

Dmitry Shnayder
8 min readApr 2, 2024

--

I have an automated and connected home. My thermostat talks with weather forecast. My light switches talk to my Apple Watch. My car keeps record of its movement. All of that runs on a single fanless computer, sitting in my basement. While it doesn’t require ane maintenance, I still want to keep an eye on it. Does it have enough memory? How busy is CPU? Does it overheat?

Since the computer has some sensitive information, I don’t want to allow any additional access to it. So usual Prometheus scraping doesn’t excite me. I prefer if the box export metrics itself. This way no ports need to be open, less security to worry about, and the box can be located behind NAT.

There is an open-source solution for collecting and pushing various metrics — Telegraf. It has many plugins, to collect metrics from different sources, processing and publishing them to many types of destinations. Telegraf has plugins for collecting temperature, as well as CPU and memory usage.

Obvious solution for displaying the collected metrics is Grafana. Based on my experience with TeslaMate, there are unlimited possibilities to visually represent stats using Grafana. I found that Grafana since version 8 supports receiving live stream of events from Telegraf. The flow of messages would be:

A separate Telegraf process collects required metrics (CPU and memory usage, temperature), and pushes it to another computer, where dashboards are displayed. So server machine doesn’t require any additional access or open ports. Grafana box, even if gets hacked, will provide attacker information about my CPU temperature. Not a big deal.

I start with Grafana box. The simplest way of getting Grafana up and running is docker compose. On a Linux box, in an empty directory, create file named docker-compose.yaml with content:

I decided to provide Grafana container with separate docker volume, so I can update or change container image without losing my dashboards.

Then start the container: docker-compose up -d and check if its running: docker container ls

Lets open wen interface and login to Grafana. Open http://ipaddress:3000

Defaut username is admin, and defaul password is also admin.

Grafana asks to change default password, lets do it.

If everything went well, then we get the welcome screen.

Lets leave Grafana alone for a while and install Telegraf. First, find a Linux box with Golang compiler and clone Telegraf Git repository:

git clone git@github.com:influxdata/telegraf.git

Then build Telegraf:

cd telegraf
make

It may take a while to build all the goodies, provided with telegraf. When make is done, there will be a binary file called telegraf in the current directory. Copy it to the computer, where metrics need to be collected. I copied the binary to /usr/local/bin.

Next, we need to configure telegraf to collect CPU, memory and temperature metrics. Create file /etc/telegraf/telegraf.conf with content:

In my case 192.168.8.5 is IP address of the Grafana box. The path “/api/live/push/kvm” contains two parts:

  • /api/live/push — fixed part, indicates to Grafana, that its a live data
  • kvm — identifies topic. Choose any reasonable topic name here.

Collection and reporting intervals are set to 1 second. This gives practically live data.

One thing we still missing here is authorization token. We don’t want anybody, who knows IP address of Grafana box, to be able to update content of dashboards. To create authorization token, lets get back to Grafana web interface. Navigate to Administration -> Users and Access -> Service Accounts

Click “Add service account”. Lets name the account “telegraf” and give it a “Viewer” role:

The we need to add service account token:

Click “Add service account token”, pick any name for the new token, and click “Generate token”

A long string will be generated.

Click “Copy to clipboard and close”. We need to add this token to the telegraf.conf

We are done with telegraf config. Lets start telegraf. I prefer to define telegraf as systemd service. To do that, I create file /etc/systemd/system/telegraf.service with content

[Unit]
Description=Telegraf
StartLimitIntervalSec=0
[Service]
ExecStart=/usr/local/bin/telegraf
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target

Then enable and start telegraf service:

systemctl daemon-reload
systemctl start telegraf
systemctl enable telegraf

Lets check if telegraf is running: systemctl status telegraf

Everything looks good. Lets see what we get in Grafana. Click “Dashboards” -> “Create dahsboard”

We need to add first visualization, that I used to call chart.

Now we need to choose datasource. There is a “ — Grafana — “ predefined datasource on the righ side of the screen, this is where our live data comes. Click on it.

Initially it shows some sample data. To see what telegraf send to us, select “Live Measurements” as “Query type”

Then select a channel

At the bottom of the list there are three channels. Channel name consists of three parts:

  • stream — indicates that data comes from live streams
  • kvm — this is what I choose to put to the path in the telegraf config
  • cpu, mem, temp — particular metric, that is what collected by telegraf

Lets start with memory utilization chart. Select “stream/kvm/mem”. Telegraf collected a lot of memory metrics:

Lets show only percent of used memory. In the “Fields” drop-down select “user_percent”

The chart starts displaying timeseries of available memory:

With live data timeseries is not much useful, as it shows only time when dashboard was open. Lets change the visualization to something more live. In the top-right corner open selection of visializations:

I like how gauges are displayed in Grafana, lets use it for memory usage

Looks better. However, “Panel Title” == 25.7 is not too informative. Lets change title. In the right pane find “Panel options” -> “Title” and name the chart “Memory usage”

Then scroll down to “Standard options” and select “Unit” — “Percent”

Looks better:

Lets apply the new chart to the dashboard. Click “Apply” in the top-right corner. Now we have a dashboard with one chart on it.

Lets add more. Click “Add” -> “Visualization”

Now select channel “stream/kvm/cpu”

There are a lot of metrics, but no total CPU usage:

We need to calculate it as 100 — usage_idle.Select tab “Transform data”, then click “Add transformation”

Out of variety of transformations we need “Calculation”

Mode will be “Binary operation”, “Operation” is “100 — usage_idle”, and the alias (result field name) I choose “usage_total”. Enable “Replae all fields”

Select viaualization type “Gauge”. On the “Query” tab select “All fields”, and here we go, we get CPU usage gauge:

As with memory usage, update title and unit and apply. Now we get two charts.

Last chart (visualisation), that I need to add is temperature. As ususal, click “Add” -> “Visualisation”, select query type “Live measurement”, and select channel “stream/kvm/temp”. Looks like I have nine temperature sensors in this fanless box.

I don’t want to monitor nine gauges. Lets create one, that shows maximum temperature acress all sensors. We will add data transformation, similar to CPU usage. But instead of “Binary operation” will select “Reduce row” mode. In the “Operation” field, select all “temp” metrics. Select “Max” calculation and name the resulting field “max_temp” in the “Alias” field. Select “Replace all fields” to use only calculated maximum temperature for display.

Change the chart title to “Maximum temperature” and select units “Celsius”. Also I specified expected minimum and maximum temperature to have consistent relative position of the gauge:

To make the chart more lively, I specified thresholds for cold / normal / hot:

We are done. Lets save the dashboard

Charts can be repositioned on the dashboard by dragging on the title.

Now I have critical utilization metrics of my home automation box at glance. Looks great. Next step will be adding historical data.

--

--