Monitoring Factom nodes with Grafana — an introduction.

Most of the Factom Authority Node Operators (ANOs for short) are monitoring their servers and factomd-nodes one way or the other.

In The Factoid Authority (TFA) we have mainly used our internally developed open-source “TFA-bot” for this purpose, and it has served us well.

“The Factoid Authority Bot, 
Monitors your nodes, 
Wakes you up when it all goes tits up.”

It’s purpose is to alert the ANO (and others who are running factomd) if the node stops responding, and it does this very well; consistently sending alert-messages on Discord as well as triggering phone calls to the engineers on duty if the node crashes or become unavailable.

This does however not provide an easily accessible overview of the general node health when factomd is happily running — nor of the server itself.

During network incidents and node upgrades we have utilized simple tools as htop to monitor relevant performance data, and it serves for the purpose but not very well. For example, it doesn’t store historic data, you have to SSH in to the box in question (and keep the connection alive), and its hard to compare data across nodes. In addition you can not produce relevant data for Factom Core Committee investigators debugging network issues after the fact.

So what is the solution that alleviates the above issues? Well there are many of course; but the one we will talk about today is using Grafana in conjunction with Prometheus and node_exporter.

Grafana; Dashboard displaying and querying the relevant metrics.
Prometheus; Database software scraping and storing the information.
Node_Exporter; Daemon running on servers exposing statistics to Prometheus.
Simple diagram showing the monitoring system layout. Note: It is common to run Prometheus and Grafana on the same server.

What is not included in the above diagram is the fact that factomd itself, by default, also acts as an endpoint for Prometheus, serving up interesting factomd-data like number of p2p-connections, blockheight, queues, garbage collection and amount of go-routines etc. In this guide we will also describe how to include these stats in the monitoring as well.

The Grafana dashboard includes user authentication and may be exposed to the world (after being properly configured and secured!), as to enable all team members easy access to the monitoring system. If this is not desired other solutions includes SSH-tunneling or running Grafana locally.


What is being monitored?

Node_exporter provides a ton of different statistics and metrics, but we will focus on monitoring the following server-stats for now:
 — CPU usage (%)
 — System load
 — Network traffic
 — Disk space used

This dashboard shows server-metrics for TFA’s Factom block explorer the past 15 minutes.
By the click of a button nodes can be added/removed from the dashboard. Here the Factom Testnet explorer and the Mainnet voting daemon for the on-chain voting system has been added to the mix.
The time range displayed is easily accessible via a panel like this. By default Prometheus stores data for 15 days.

For factomd all stats are currently included in the dashboard and available in Grafana. The presentation differs slightly from the server-stats provided by node_exporter in that factomd-stats for all nodes are displayed in the same graph:

Notice the garbage collection in the bottom-graph. A testnet factomd-node was “stuck” after experiencing high load, and had to be restarted to function properly (12 hours of data showing).

In addition to the above monitoring we will set up Discord alerts being triggered under the following conditions:
- Any monitored server has less than 20% free space left on root-filesystem.
- Any monitored server has higher cpu utilization than 50% for 20 minutes.

These alerts are mainly provided as examples; and a host of different alerts can be set up and tweaked as necessary. There are also a ton of different notification channels available as shown below:

Multiple options are available for issuing alerts; from simple email-alerts to full PagerDuty integration.

As TFA is not using Grafana-alerts for issues requiring immediate action we decided to go with Discord for Grafana-alerts, as it is slightly more non-intrusive than an email or direct message via PagerDuty etc.


TL;DR

Some of you may already know and operate Grafana, and just want to get to the juicy bits. Well here they are:

  • Factomd monitoring dashboard: ID = 10008
  • Server monitoring dashboard: ID = 10007
  • Alerting dashboard: ID = 10009

Prerequisites for setting up monitoring:

  • A server to host Prometheus and Grafana.
  • At least one server running factomd
  • Port 9100 and 8090 open from factomd-machine to Grafana-machine.
  • Access to port 9090 and 3000 on the server hosting Prometheus/Grafana.
  • Around 30 minutes of time to spare.

Installing Prometheus

There are multiple ways of installing Prometheus, and most of them are detailed here in the Prometheus official documentation. It is suggested that everyone follow the installation instructions described in the link above, but if you just want to run a hacky test to check it out in a safe and unimportant environment the instructions below can be followed on your own risk.

  • Create directories to host the Prometheus data and the Prometheus configuration file:
sudo mkdir /etc/prometheus
sudo mkdir /var/lib/prometheus
  • Give yourself access to those folders:
sudo chown USERNAME:USERNAME /etc/prometheus
sudo chown USERNAME:USERNAME /var/lib/prometheus
  • Download and unpack Prometheus (note; verify that you are downloading the latest version here):
cd ~
curl -LO curl -LO https://github.com/prometheus/prometheus/releases/download/v2.8.1/prometheus-2.8.1.linux-amd64.tar.gz
tar xvf prometheus-2.8.1.linux-amd64.tar.gz
  • Copy the Prometheus files to their destination:
sudo cp prometheus-2.8.1.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-2.8.1.linux-amd64/promtool /usr/local/bin/
sudo cp -r prometheus-2.8.1.linux-amd64/consoles /etc/prometheus
sudo cp -r prometheus-2.8.1.linux-amd64/console_libraries /etc/prometheus
  • Remove the temporary installation files:
rm -rf prometheus-2.8.1.linux-amd64.tar.gz
rm -rf prometheus-2.8.1.linux-amd64
  • Create and open the Prometheus configuration file:
sudo nano /etc/prometheus/prometheus.yml
  • Paste the following text into the configuration file:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
#
#
#
#------LOCALHOST START
#  - job_name: 'HOSTNAME-Exporter'
#   scrape_interval: 5s
#   static_configs:
#    - targets: ['localhost:9100']
#------LOCALHOST END
#
#
#------NODE_EXPORTERS START
- job_name: ‘EXTERNAL-HOSTNAME-Exporter'
static_configs:
- targets: ['IP-OF-BOX:9100']
#------NODE_EXPORTERS END
#
#
#------FACTOMD-MONITORING STARTS
- job_name: ‘EXTERNAL-HOSTNAME-Factomd’
static_configs:
- targets: ['IP-OF-BOX:8090']
#------FACTOMD-MONITORING ENDS

In the above config file two things needs to be amended:

  • Substitute “EXTERNAL-HOSTNAME” with the name of the server you want to monitor.
  • Substitute “IP-OF-BOX” with the IP of the server you want to monitor.

Note that the format is .yml and it requires a very specific syntax to operate, with the exact indentations provided in the sample config file. Use of tabs are strictly forbidden!

Pro-tip: The localhost (the server you are installing Prometheus on) can also be monitored by installing node_exporter locally, and uncommenting the lines between LOCALHOST-START and LOCALHOST-END.

  • Create a Prometheus service-file:
sudo nano /etc/systemd/system/prometheus.service
  • Insert the following text in the config file you created 
    (note: Replace YOUR -USERNAME with your actual username):
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=YOUR-USERNAME
Group=YOUR-USERNAME
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries
[Install]
WantedBy=multi-user.target
  • Reload the system daemon:
sudo systemctl daemon-reload
  • Start Prometheus:
sudo systemctl start prometheus
  • Verify that Prometheus is running:
sudo systemctl status prometheus

If prometheus has started correctly you will see the following (hit q to quit):

Look for the text “active (running)”. If you see this all is good.

Most likely you will not see the above message however; as it turns out that the .yml configuration file is quite hard to get right on the first try. The error message presented will give an indication of whats wrong.

Pro-tip: Check if there is an empty — blank — line in the very end of the config file. This is a very common issue. It should be removed.

  • When you have verified that Prometheus is in fact running properly you should enable the Prometheus service to boot at startup:
sudo systemctl enable prometheus
  • Navigate to localhost:9090 (Prometheus dashboard) and verify that the Prometheus dashboard is available.

If it does not show up something is wrong and you should figure out what before moving on to the next step :-)

Pro-tip: After doing changes to /etc/prometheus/prometheus.yml (adding new servers to monitor, amending current ones) you need to restart the service with the following command: “sudo systemctl restart prometheus”.

A smart way to make this more convenient is to add the following to your .bash_aliases file and then use the shortcuts to amend the .yml file as well as restart Prometheus:

alias prometheus-config='sudo nano /etc/prometheus/prometheus.yml'
alias prometheus-restart='sudo systemctl restart prometheus'

Installing Node_exporter

Node_exporter needs to be installed on the servers to be monitored. It is a rather simple affair, much like the Prometheus-installation but with less need for configuration. The official installation instructions are available here, but a quick-start guide is provided below — also on your own risk.

curl -LO https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz
  • Unpack it:
tar xvf node_exporter-0.17.0.linux-amd64.tar.gz
  • Copy the binary to /usr/local:
sudo cp node_exporter-0.17.0.linux-amd64/node_exporter /usr/local/bin
  • Create a service configuration file:
sudo nano /etc/systemd/system/node_exporter.service
  • Insert the following text in the configuration file
    (note: Replace YOUR -USERNAME with your actual username):
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=YOUR-USERNAME
Group=YOUR-USERNAME
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
  • Reload the system daemon:
sudo systemctl daemon-reload
  • Start Node_exporter:
sudo systemctl start node_exporter
  • Verify that node_exporter is running:
sudo systemctl status node_exporter

The output should be similar to the output when checking status of Prometheus (keyword: active(running)).

  • Enable node_exporter at boot:
sudo systemctl enable node_exporter

Install and configure grafana

Grafana’s official installation instructions are available here. A quick install guide for ubuntu/debian follows (still no warranty; perform at own risk).

wget https://dl.grafana.com/oss/release/grafana_6.1.1_amd64.deb
  • Install the package:
sudo dpkg -i grafana_6.1.1_amd64.deb

If you run into an issue with lacking dependencies during install, you may run the following command to try to fix the issue:

sudo apt --fix-broken install
  • Start the Grafana server:
sudo systemctl start grafana-server
  • Verify that Grafana is running:
sudo systemctl status grafana-server

This time as well the operative word is: active (running) displayed in green text. If this test is successful move on to enable Grafana at start:

sudo systemctl enable grafana-server

Verifying that Prometheus/node_exporter is A-ok

To verify that Prometheus is running and scraping data from the node_exporter instances running on the remote servers, navigate to the Prometheus dashboard at localhost:9090, and then click on “Status” -> “Targets”.

This Prometheus instance is monitoring one external server (both node_exporter and factom), as well as a local node_exporter installed on localhost (optional).

The “state” of the nodes should be UP. If they are not something is wrong. Troubleshoot by verifying that node_exporter (or factomd) is serving up the data. Two useful commands on the factomd-server while troubleshooting are:

curl localhost:9100/metrics
curl localhost:8090/metrics

If the data shows up with crul you can be certain that node_exporter/factomd is doing its job, and the issue most likely is that Prometheus cannot access the remote host.


Configuring Grafana

Configuring Grafana is actually quite straight-forward, and is done in a few logical steps:

  1. Connecting to Grafana and setting a new admin-password.
  2. Selecting a data-source (Prometheus).
  3. Downloading the Dashboard presets (2 different ones).
  4. Setting up alerting (optional additional dashboard).
  5. Configuring ssl, creating new user accounts, other settings etc. (not covered in this guide — information about it is readily available in official documentation at the Grafana homepage).

Now, let's get started.

  • Open localhost:3000 (or IP-of-server:3000 after opening the port), and enter username:password admin:admin. Set a new, secure, password.
  • Add a new datasource:
  • Select Prometheus
  • Enter the URL of the datasource (Prometheus) and then click Save & Test:
  • Create the first dashboard:

In the above screenshot insert “10007” in the input field named Grafana.com Dashboard (arrow 2).

  • Select a datasource (Prometheus) and click Import:

The dashboard named “Server Monitoring Dashboard” (ID 10008) should now be imported and the monitored servers visible:

Select one or multiple servers from the drop-down menu. The server’s names are fetched from the Prometheus .yml configuration file and can be configured as you like.

When you try to navigate away from this dashboard (next step) you should get queried about saving changes or not. Select Save.


  • The next step is to import the factomd-monitoring dashboard. Repeat the import-steps for dashboard ID 10008.

The factomd monitoring dashboard is now available in Grafana and should be automatically fed with all the relevant factomd-data from Prometheus.

In the upper right corner of the screen the time parameter for the displayed data can be set. Note that there is also an option for setting update interval (disabled by default).


Setting up Alerts to a private Discord server (optional)

In this final step we will set up push-alerts from Grafana to a private discord server. The setup is amazingly simple (10 minutes), and it provides a good warning system for indicating that something is off with the monitored servers.

  • To get started import dashboard with ID 10009.
  • Edit the top graph like this:
  • Configure and set the following parameters for monitoring disk space:
100.0 - 100 * (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})

Notice that the disk space monitoring becomes active immediately, visually indicating used space on nodes (Yellow/Green) as well as the alert trigger limit (red line, 80% of disk space used).

  • Click on the “back arrow” in the top-left corner of the screen to get back to the alerting dashboard.
  • Repeat the above steps for the bottom graph:
Set the same parameters for this graph (“Queries to”, “Legend” and the query itself as below.)
100 - (avg by (job) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
  • Click the “back arrow” to return to the Alerting dashboard, and it should look like this:

The next step is to configure the Webhook on your private Discord server.

  • Open up Discord settings:
  • Click “webhook” and then configure as shown in the screenshot below (copy the webhook url):
Note: You may use any private channel in your Discord server for this.
  • Open up Grafana again and set up Alerting as described below:

When you click “send test” you should get a test message in the appropriate channel on your Discord server:

Alerting via Grafana is now set up, and you will get alerts according to the variables set in the alerting dashboard. These may be configured by changing the values indicated below:


After having set up Prometheus/Grafana according to this guide one has a good starting point for delving deeper into the world of server monitoring. There are a host of different dashboards available at the Grafana dashboards page, and the possibilities for customization are basically endless.

Please visit us at the TFA discord if you have any questions or input to the dashboards.

— Best

The Factoid Authority