Global Monitoring Infrastructure for Polkadot and Kusama
Running validator infrastructure is a lot of responsibility.
Node runners need to ensure the availability, security, and redundancy of their nodes, and have backup plans of backup plans to make sure validating services run without issue. This story will look at the architecture CertHum has built for the global monitoring of CertHum validator nodes.
A robust monitoring infrastructure is essential for the delivery of validating services. With a best of breed monitoring setup you can:
- Keep track of compute, storage and network resources, and get ahead of potential resource constraints
- Monitor and report on validator chain functions
- Alert and notify when there is an issue, helping to avoid potential slashing penalties and loss of nominators
If you are setting up your own monitoring infrastructure, and haven’t done so yet, check out the fantastic bLd Nodes’ guide which has great step by step instructions on installing and configuring Prometheus, Alertmanager, and Grafana. We’ve used some of their alerting rules in our own implementation.
Centralized Monitoring Architecture
CertHum uses two different hosting providers in different regions for the delivery of Polkadot and Kusama validator nodes. One provider is used to run production services, and the second is used as warm standby with fully synched chains. The warm standby is ready for immediate failover by updating the validator session keys on their respective chains, but they have scaled down resources that are still suitable enough for authoring blocks. To provide a neutral witness for both hosting providers, CertHum runs centralized monitoring out of Microsoft Azure.
Azure’s hyperscale cloud services are some of the best in the world, and can be relied on for consistent delivery when compared to smaller cloud and VPS providers. A single Azure VM is multi-purposed to run Prometheus, Alertmanager, and Grafana. The following diagram depicts this architecture.
There are a few points on the above diagram that are worth noting:
- Source-IP filtering is used to block connectivity to the Prometheus scraping ports — allowing access only from the CertHum monitoring server
- SSH tunneling and source-IP filtering is used for connectivity between CertHum networks and the Azure VM for access to the Grafana, Alertmanager, and Prometheus GUIs
- Frequent backups of the Azure VM to geo-redundant storage (GRS) ensures the VM can be recovered to an entirely different region, if needed
Having a monitoring infrastructure is great, but if you aren’t notified when there is a problem then you are not taking advantage of one of the most important reasons for standing that infrastructure up in the first place.
To solve for the notification piece, you can see on the bottom left of the diagram that CertHum has implemented PagerDuty for alerting.
PagerDuty is a SaaS offering that provides email, SMS, and phone call notifications of monitoring alerts generated from Alertmanager (and also has a lot of other great features). If a CertHum validator node drops the number of peers below a certain threshold, or if a node goes offline, or if any of the other rules which are configured in Alertmanager are breached, we are immediately notified.
This architecture gives us the confidence that we are providing the Polkadot and Kusama communities with validator nodes that they can feel safe in nominating. This should also provide other node runners with ideas on how they can deploy their own centralized monitoring infrastructure.
There are some improvements which can be made to our design and we are in the process of putting them in place, but we would love to hear your ideas for a better architected design. Feedback is always welcome!