Monitoring Substrate node (Polkadot, Kusama, parachains) — Validator guide

bLd Nodes
bLd Nodes
Published in
15 min readApr 8, 2021

Edit 2023: this article becomes outdated and I am not maintaining this dashboard anymore. For proper monitoring, I strongly suggest to refer to the new monitoring stack article here with a brand new dashboard, much more relevant than this one.

Monitoring dashboard Polkadot Essentials

The Polkadot ecosystem has raised an incredible attention in the past few months and this is just the beginning. Many individuals joined the adventure and set up their own validator node, which is extraordinary for decentralization. However, maintaining a validator node is a huge responsibility as you are basically securing millions of dollars on the blockchain. Health and security of your node have to be the top priority for you as a validator.

This guide provides helpful monitoring and alerting content for validators. The examples provided use the Plasm Network node, but most of the configuration is exactly the same for a Polkadot, Kusama, or any Substrate based node (also known as parachain).

However, monitoring is of course not all, security has to be considered very carefully. You should review our secure SSH guide for implementing basic security to your connection.

Here are the different steps we will walk through:

You can also find extra sections at the end of this tutorial:

For more convenience, all parameters you should change with your own value or can change with time are identified in bold inside code blocks:

Example of a value you should change

This guide was created using Ubuntu 20.04 LTS with a Plasm node on the server side and Debian 10 Buster on the client side.

General understanding

Here is how our final configuration will look like at the end of this guide.

Node monitoring modules
  • Prometheus is the central module; it pulls metrics from different sources to provide them to the Grafana dashboard and Alert Manager.
  • Grafana is the visual dashboard tool that we access from the outside (through SSH tunnel to keep the node secure).
  • Alert Manager listens to Prometheus metrics and pushes an alert as soon as a threshold is crossed (CPU % usage for example).
  • Your Substrate node (any Polkadot based blockchain) natively provides metrics for monitoring.
  • Node exporter provides hardware metrics of the dashboard.
  • Process exporter provides processes metrics for the dashboard (optional).

Since you are running a production node, it is very important not to expose open ports to the outside (moreover a http port). A secure way to avoid that is to set SSH tunneling, let’s start with that.

Set SSH tunneling

Grafana runs an HTTP server on your local node so basically, we shouldn’t access it directly from the outside.

SSH tunneling is considered to be a safe way to make traffic transit from your node to your local computer (or even phone). The principle is to make the SSH client listen to a specific port on your local machine, encrypt traffic through SSH protocol and forward it to the target port on your node.

SSH tunnel

Of course, you could also configure Grafana to run a HTTPS server but we do not want to expose another open port. Since our data will be encrypted with SSH, we do not need HTTPS.

Once we have finished installing Grafana on our node, we will access it through this address on our local machine: http://localhost:2022

If you are using Putty to connect, jump directly to this part.

Open SSH

When using OpenSSH on your client machine, the arguments look like this:

-L 2022:localhost:3000
  • -L for a local port forwarding
  • 2022 is the local port we arbitrary chose (please use a different unused local port inside the range 1024–49151)
  • 3000 is Grafana’s port

Assuming you already followed and set up our article for securing SSH access, you will connect to the node with your private key, the full command will look like this:

ssh -i ~/.ssh/id_ed25519 <user@server ip> -p 2021 -L 2022:localhost:3000
  • id_ed25519 is our local private key
  • 2021 is the custom SSH port we configured to connect to our node

Great, now once we have finished installing Grafana on our node, we will access it through this address on our local machine:

http://localhost:2022

Automating OpenSSH connection

Remembering all parameters of the ssh command can really be a pain, especially if you have several nodes to maintain.

So, let’s just create a little a config file with our parameters:

touch ~/.ssh/config
nano ~/.ssh/config

In this file, we will add the parameters for our node, including port forwarding:

Host node
HostName 66.66.66.66
Port 2021
User bld
IdentityFile ~/.ssh/id_ed25519
LocalForward 2022 localhost:3000
ServerAliveInterval 120
  • HostName is your node IP address.
  • Port 2021 is the SSH port we use to connect to our node (we changed and closed the default port 21, remember?).
  • We also add a keep alive parameter every 2mn to keep the session active.

Putty

As Putty is a very popular client usable on many OS, here is where you can configure local port forwarding. You do not need this part if you use OpenSSH to connect.

Putty port forwarding

Inside the SSH > Tunnel’s menu, just add the local port and destination then click Add.

  • 2022 is the local port we arbitrary chose (please use a different unused local port inside the range 1024–49151)
  • 3000 is Grafana’s port

Don’t forget to save the session.

Installation

To save your precious time, we added && after each line of long blocks. Of course, you can remove those and copy/paste lines one by one if you like to suffer :-)

Let’s start with the prerequisites:

sudo apt update && sudo apt upgrade
sudo apt install -y adduser libfontconfig1

Download the latest releases. Please check Prometheus, Process exporter and Grafana download pages.

wget https://github.com/prometheus/prometheus/releases/download/v2.32.0/prometheus-2.32.0.linux-amd64.tar.gz &&wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz &&wget https://github.com/ncabatoff/process-exporter/releases/download/v0.7.10/process-exporter-0.7.10.linux-amd64.tar.gz &&wget https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.23.0.linux-amd64.tar.gz &&wget https://dl.grafana.com/oss/release/grafana_8.3.3_amd64.deb

Extract the downloaded files:

tar xvf prometheus-*.tar.gz &&
tar xvf node_exporter-*.tar.gz &&
tar xvf process-exporter-*.tar.gz &&
tar xvf alertmanager-*.tar.gz &&
sudo dpkg -i grafana*.deb

Copy the extracted files into /usr/local/bin:

sudo cp ./prometheus-*.linux-amd64/prometheus /usr/local/bin/ &&sudo cp ./prometheus-*.linux-amd64/promtool /usr/local/bin/ &&sudo cp -r ./prometheus-*.linux-amd64/consoles /etc/prometheus &&sudo cp -r ./prometheus-*.linux-amd64/console_libraries /etc/prometheus &&sudo cp ./node_exporter-*.linux-amd64/node_exporter /usr/local/bin/ &&sudo cp ./process-exporter-*.linux-amd64/process-exporter /usr/local/bin/ &&sudo cp ./alertmanager-*.linux-amd64/alertmanager /usr/local/bin/ &&sudo cp ./alertmanager-*.linux-amd64/amtool /usr/local/bin/

Install the Alert manager plugin for Grafana:

sudo grafana-cli plugins install camptocamp-prometheus-alertmanager-datasource

Create dedicated users:

sudo useradd --no-create-home --shell /usr/sbin/nologin prometheus &&
sudo useradd --no-create-home --shell /usr/sbin/nologin node_exporter &&
sudo useradd --no-create-home --shell /usr/sbin/nologin process-exporter &&
sudo useradd --no-create-home --shell /usr/sbin/nologin alertmanager

Create directories for Prometheus, Process exporter and Alert manager:

sudo mkdir /var/lib/prometheus && 
sudo mkdir /etc/process-exporter &&
sudo mkdir /etc/alertmanager &&
sudo mkdir /var/lib/alertmanager

Change the ownership for all directories:

sudo chown prometheus:prometheus /etc/prometheus/ -R &&
sudo chown prometheus:prometheus /var/lib/prometheus/ -R &&
sudo chown prometheus:prometheus /usr/local/bin/prometheus &&
sudo chown prometheus:prometheus /usr/local/bin/promtool &&
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter &&
sudo chown process-exporter:process-exporter /etc/process-exporter -R &&
sudo chown process-exporter:process-exporter /usr/local/bin/process-exporter &&
sudo chown alertmanager:alertmanager /etc/alertmanager/ -R &&
sudo chown alertmanager:alertmanager /var/lib/alertmanager/ -R &&
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager &&
sudo chown alertmanager:alertmanager /usr/local/bin/amtool

Finally, clean up the download directory:

rm -rf ./prometheus* &&
rm -rf ./node_exporter* &&
rm -rf ./process-exporter* &&
rm -rf ./alertmanager* &&
rm -rf ./grafana*

Exhausting right? We grouped all install once and for all, now let’s have some fun with configuration.

Configuration

Prometheus

Let’s edit the Prometheus config file and add all the modules in it:

sudo nano /etc/prometheus/prometheus.yml

Add the following code to the file and save:

global:
scrape_interval: 15s
evaluation_interval: 15s

rule_files:
- 'rules.yml'

alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093

scrape_configs:
- job_name: "prometheus"
scrape_interval: 5s
static_configs:
- targets: ["localhost:9090"]
- job_name: "substrate_node"
scrape_interval: 5s
static_configs:
- targets: ["localhost:9615"]
- job_name: "node_exporter"
scrape_interval: 5s
static_configs:
- targets: ["localhost:9100"]
- job_name: "process-exporter"
scrape_interval: 5s
static_configs:
- targets: ["localhost:9256"]
  • scrape_interval defines how often Prometheus scrapes targets, while evaluation_interval controls how often the software will evaluate rules.
  • rule_files set the location of Alert manager rules we will add next.
  • alerting contains the alert manager target.
  • scrape_configs contain the services Prometheus will monitor.

You can notice the first scrap where Prometheus monitors itself.

Alert rules

Let’s create the rules.yml file that will give the rules for Alert manager:

sudo touch /etc/prometheus/rules.yml
sudo nano /etc/prometheus/rules.yml

We are going to create 2 basic rules that will trigger an alert in case the instance is down or the CPU usage crosses 80%. Add the following lines and save the file:

groups:
- name: alert_rules
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance $labels.instance down"
description: "[{{ $labels.instance }}] of job [{{ $labels.job }}] has been down for more than 1 minute."

- alert: HostHighCpuLoad
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
for: 0m
labels:
severity: warning
annotations:
summary: Host high CPU load (instance bLd Kusama)
description: "CPU load is > 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"

The criteria for triggering an alert are set in the expr: part. To create your own alerts, you’re going to have to learn and test the different variables provided to Prometheus by the services we are setting up. There is (almost) an infinite number of possibilities to personalize your alerts.

As this part can be time-consuming to learn and build, we have shared a list of alerts we like to use. Please feel free to reach us if you have an interesting one you would like to set.

You should also have a look at alerts provided by Parity.

Then, check the rules file:

promtool check rules /etc/prometheus/rules.yml

And finally, check the Prometheus config file:

promtool check config /etc/prometheus/prometheus.yml

Process exporter

Process exporter needs a little config file to be told which processes they should take into account:

sudo touch /etc/process-exporter/config.yml
sudo nano /etc/process-exporter/config.yml

Add the following code to the file and save:

process_names: 
- name: "{{.Comm}}"
cmdline:
- '.+'

Gmail setup

We will use a Gmail address to send the alert emails. For that, we will need to generate an app password from our Gmail account.

Note: we recommend you here to use a dedicated email address for your alerts.

Google has this bad habit to change its interface pretty often so, instead of giving you the detailed steps here and make this guide outdated next month, we are cowardly sending you to the Gmail app password procedure page :-)

Gmail app password generated

The result will look like that (sorry for the French screen, I was a little lazy to change my whole account language setting). Copy the password and save it for later.

Alert manager

The Alert manager config file is used to set the external service that will be called when an alert is triggered. Here, we are going to use the Gmail notification created previously.

Let’s create the file:

sudo touch /etc/alertmanager/alertmanager.yml
sudo nano /etc/alertmanager/alertmanager.yml

And add the Gmail configuration to it and save the file:

global:
resolve_timeout: 1m

route:
receiver: 'gmail-notifications'

receivers:
- name: 'gmail-notifications'
email_configs:
- to: 'mydedicatednodealertaddress@protonmail.com'
from: 'bldnodes@gmail.com'
smarthost: 'smtp.gmail.com:587'
auth_username: 'bldnodes@gmail.com'
auth_identity: 'bldnodes@gmail.com'
auth_password: 'yrymyemufalyjing'
send_resolved: true

Of course, you have to change the email addresses and the auth_password with the one generated from Google previously (you didn’t seriously think we were going to let this password be active right? :-))

Here you can notice we use a different address between sender and receiver, this is actually a little useful trick that lets you install a dedicated application (Protonmail is very cool) to receive alerts even on the phone. This way, you will know in the minute if something goes wrong with your node without being disturbed by other emails!

Note: the email notification is just an example, you can push notifications to many different services! Have a search with DuckDuckGo (enough with big G here) you will love what you can find.

Setting services

Starting all programs manually is such a pain, moreover with all we have here. So we are going to take a few minutes to create the systemd services.

Creating those services will allow a fully automated process that you will never have to do again in case your node reboots.

I know, it’s long to create them one by one but after all, you’re maintaining a node, you can’t be a lazy person :-)

Node

Let’s start simple, in case you didn’t set a service for your node, this is highly recommended.

Warning: you should not do that if your node is actively validation unless you know exactly what you do. This will cause a chain resync because we change the chain storage directory.

Note: once again, this guide was made with a Plasm node. If you use Polkadot, Kusama or any other Substrate node, you just have to adapt the bold values for this one.

Create a dedicated user for the node binary and copy it to /user/sbin:

sudo useradd --no-create-home --shell /usr/sbin/nologin plasm &&
sudo cp ./plasm /usr/local/bin &&
sudo chown plasm:plasm /usr/local/bin/plasm

Create a dedicated directory for the chain storage data:

sudo mkdir /var/lib/plasm &&
sudo chown plasm:plasm /var/lib/plasm

Create and open the node service file:

sudo touch /etc/systemd/system/plasm.service &&
sudo nano /etc/systemd/system/plasm.service

Add the lines matching your node configuration:

[Unit]
Description=Plasm Validator

[Service]
User=plasm
Group=plasm
ExecStart=/usr/local/bin/plasm \
--validator \
--rpc-cors all \
--name <Your Validator Name> \
--base-path /var/lib/plasm
Restart=always
RestartSec=120

[Install]
WantedBy=multi-user.target

Reload daemon, start and check the service:

sudo systemctl daemon-reload
sudo systemctl start plasm.service
sudo systemctl status plasm.service

If everything is working fine, activate the service:

sudo systemctl enable plasm.service

In case of trouble, check the service log with:

journalctl -f -u plasm -n100

Note: if your chain was previously synced somewhere else, purge it:

/usr/local/bin/plasm purge-chain

For the next ones, we are going to do it all in a row.

Prometheus

Create and open the Prometheus service file:

sudo touch /etc/systemd/system/prometheus.service &&
sudo nano /etc/systemd/system/prometheus.service

Add the following lines:

[Unit]
Description=Prometheus Monitoring
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

Node exporter

Create and open the Node exporter service file:

sudo touch /etc/systemd/system/node_exporter.service &&
sudo nano /etc/systemd/system/node_exporter.service

Add the following lines:

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

Process exporter

Create and open the Process exporter service file:

sudo touch /etc/systemd/system/process-exporter.service &&
sudo nano /etc/systemd/system/process-exporter.service

Add the following lines:

[Unit]
Description=Process Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=process-exporter
Group=process-exporter
Type=simple
ExecStart=/usr/local/bin/process-exporter \
--config.path /etc/process-exporter/config.yml

[Install]
WantedBy=multi-user.target

Alert manager

Create and open the Alert manager service file:

sudo touch /etc/systemd/system/alertmanager.service &&
sudo nano /etc/systemd/system/alertmanager.service

Add the following lines:

[Unit]
Description=AlertManager Server Service
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file /etc/alertmanager/alertmanager.yml \
--storage.path /var/lib/alertmanager \
--web.external-url=http://localhost:9093 \
--cluster.advertise-address='0.0.0.0:9093'

[Install]
WantedBy=multi-user.target

Wow, it’s been a bunch of configurations right? Good news: if you did everything right, we are ready to fire the engine and test it.

Grafana

The Grafana’s service is automatically created during extraction of the deb package, you do not need to create it manually.

Launch and activate services

Launch a daemon reload to take the services into account in systemd:

sudo systemctl daemon-reload

Start the services:

sudo systemctl start prometheus.service &&
sudo systemctl start node_exporter.service &&
sudo systemctl start process-exporter.service &&
sudo systemctl start alertmanager.service &&
sudo systemctl start grafana-server

And check that they are working fine, one by one:

systemctl status prometheus.service
systemctl status node_exporter.service
systemctl status process-exporter.service
systemctl status alertmanager.service
systemctl status grafana-server

A service working fine should look like this:

Prometheus service running

When everything is okay, activate the services!

sudo systemctl enable prometheus.service &&
sudo systemctl enable node_exporter.service &&
sudo systemctl enable process-exporter.service &&
sudo systemctl enable alertmanager.service &&
sudo systemctl enable grafana-server

Test Alert manager

Run this command to fire an alert:

curl -H "Content-Type: application/json" -d '[{"labels":{"alertname":"Test"}}]' localhost:9093/api/v1/alerts

Check your inbox, you have a surprise:

Mail alert from Alert manager

You will always receive a Firing alert first, then a Resolved notification to indicate the alert isn’t active anymore.

Run Grafana dashboard

Now is the time to get the most visual part: the monitoring dashboard.

From the browser on your local machine, connect to the custom port on localhost that we have set at the beginning of this guide:

http://localhost:2022
Grafana login page

Enter the default user admin and password admin then change password.

Grafana home page

Add datasources

Open the Settings menu:

Settings

Click on Data Sources:

Data sources

Click on Add data source:

Data source search

Select Prometheus:

Prometheus configuration

Just fill the URL with http://localhost:9090 and click Save & Test.

Then add a new data source and search for Alert manager

Data source search

Fill the URL with http://localhost:9093 and click Save & Test.

Alert manager configuration

Now you have your 2 data sources set like that:

Import the dashboard

Open the New menu:

Click on Import:

Import Dashboard

Select our favorite dashboard 13840 (we created this one just for you:-)) and click Load:

Dashboard import settings

Select the Prometheus and AlertManager sources and click Import.

Dashboard selection

In the dashboard selection, make sure you select:

  • Chain Metrics: polkadot for a Polkadot/Kusama node, substrate for any other parachain node
  • Chain Instance Host: localhost:9615 to point the chain data scrapper
  • Chain Process Name: the name of your node binary

And here you go, everything is set!

Monitoring dashboard Polkadot Essentials

Easy right? Just think about saving the dashboard once parameters are set and work.

Save dashboard settings locally

Note: you can also consider the Parity’s dashboards for an advanced monitoring and analysis.

Conclusion

We learned in this tutorial how to set a full monitoring solution with the most useful modules: Prometheus, Node exporter, Process exporter, Alert manager and Grafana.

There are many great guides all over the web that will be much more detailed but we wanted to provide this one that would be an ‘all-in-one’ solution.

The dashboard we build is mostly a compilation from different existing ones that we customized, added features that we found interesting for Substrate node validators. If you like it, we would love you to post a review on Grafana’s website and more importantly, send us your feedback so that we can improve it with needs from the community.

  • Twitter : @bLdNodes
  • Mail : bldnodes@gmail.com
  • Matrix : @bld759:matrix.org

Enjoy your node!

Advanced usage of Prometheus

Prometheus gives an incredible number of options, but you need to be familiar with queries to use it.

If you would like to test all tools provided by Prometheus, you will have to forward another local port to the server’s port 9090 (Prometheus service):

-L 2023:localhost:9090

Then access the Prometheus interface from your local port: http://localhost:2023

Prometheus interface

From here you can test queries, check status, alerts…

Useful tips

You should always start by checking the service(s):

systemctl status prometheus.service

A service working fine should look like this:

Prometheus service running

After a change in a service, you always have to launch a daemon reload:

sudo systemctl daemon-reload

You can get a longer history log of your node process by using

journalctl -f -u <node service name> -n100
  • <node service name> : polkadot-validator, plasm… use the name of your node service file (without the .service)
  • -n100 : number of lines to display

Troubleshooting

The list below is under construction and will be updated with feedback from the community so please reach us to share, this will help others!

Port already in use

You could have started a program manually, having a conflict with your service. If you have this type of error, check the processes using the port (for example here, port 9090 for Prometheus):

sudo lsof -i:9090

If you see something here, you can kill it:

sudo lsof -ti:9090 | sudo xargs kill -9

Cannot listen to port (OpenSSH)

If you didn’t exit properly from a previous connection and start ssh connection again, you may encounter this message:

channel_setup_fwd_listener_tcpip: cannot listen to port: 2022

In this case, on the client-side, close ssh connection and kill the port forwarding process still running:

lsof -ti:2022 | xargs kill -9

--

--

bLd Nodes
bLd Nodes

Validator node service for Polkadot ecosystem blockchains. Security and reliability 1st. Actually running Kusama & Plasm Network