CertHum
Published in

CertHum

Global Monitoring Infrastructure for Polkadot and Kusama

Running validator infrastructure is a lot of responsibility.

Node runners need to ensure the availability, security, and redundancy of their nodes, and have backup plans of backup plans to make sure validating services run without issue. This story will look at the architecture CertHum has built for the global monitoring of CertHum validator nodes.

A robust monitoring infrastructure is essential for the delivery of validating services. With a best of breed monitoring setup you can:

  • Keep track of compute, storage and network resources, and get ahead of potential resource constraints
  • Monitor and report on validator chain functions
  • Alert and notify when there is an issue, helping to avoid potential slashing penalties and loss of nominators

If you are setting up your own monitoring infrastructure, and haven’t done so yet, check out the fantastic bLd Nodes’ guide which has great step by step instructions on installing and configuring Prometheus, Alertmanager, and Grafana. We’ve used some of their alerting rules in our own implementation.

Centralized Monitoring Architecture

CertHum uses two different hosting providers in different regions for the delivery of Polkadot and Kusama validator nodes. One provider is used to run production services, and the second is used as warm standby with fully synched chains. The warm standby is ready for immediate failover by updating the validator session keys on their respective chains, but they have scaled down resources that are still suitable enough for authoring blocks. To provide a neutral witness for both hosting providers, CertHum runs centralized monitoring out of Microsoft Azure.

Azure’s hyperscale cloud services are some of the best in the world, and can be relied on for consistent delivery when compared to smaller cloud and VPS providers. A single Azure VM is multi-purposed to run Prometheus, Alertmanager, and Grafana. The following diagram depicts this architecture.

CertHum Global Monitoring Infrastructure

There are a few points on the above diagram that are worth noting:

  • Source-IP filtering is used to block connectivity to the Prometheus scraping ports — allowing access only from the CertHum monitoring server
  • SSH tunneling and source-IP filtering is used for connectivity between CertHum networks and the Azure VM for access to the Grafana, Alertmanager, and Prometheus GUIs
  • Frequent backups of the Azure VM to geo-redundant storage (GRS) ensures the VM can be recovered to an entirely different region, if needed

Having a monitoring infrastructure is great, but if you aren’t notified when there is a problem then you are not taking advantage of one of the most important reasons for standing that infrastructure up in the first place.

To solve for the notification piece, you can see on the bottom left of the diagram that CertHum has implemented PagerDuty for alerting.

PagerDuty is a SaaS offering that provides email, SMS, and phone call notifications of monitoring alerts generated from Alertmanager (and also has a lot of other great features). If a CertHum validator node drops the number of peers below a certain threshold, or if a node goes offline, or if any of the other rules which are configured in Alertmanager are breached, we are immediately notified.

This architecture gives us the confidence that we are providing the Polkadot and Kusama communities with validator nodes that they can feel safe in nominating. This should also provide other node runners with ideas on how they can deploy their own centralized monitoring infrastructure.

There are some improvements which can be made to our design and we are in the process of putting them in place, but we would love to hear your ideas for a better architected design. Feedback is always welcome!

--

--

--

CertHum provides Web3 infrastructure and is building for a decentralized future. At CertHum, we provide services that support and enhance the functioning of top tier blockchains. We want to share what we learn with you.

Recommended from Medium

Is Site Reliability Engineering the next step of the mainframe modernization journey?

This overall view of the Shuttle (White) Flight Control Room (WFCR) in Johnson Space Center’s Mission Control Center.

Gravitee.io API Platform 3.3

That time I attempted Garden Landscaping using Agile

CS371p Fall 2021: Carlos Vela (Week of 4 Oct. — 10 Oct.)

Time to pump up the tyres on Agile — without a budget blow-out

Process lay two stuff forget to.

Bringing simplicity to authoring and execution of automated tests

Java Core: OOP, References and Primitives, Variables

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jim Farley

Jim Farley

Jim is the founder of CertHum Inc.

More from Medium

Hello Fuse: DIA’s Oracle Infrastructure Get Integrated with Fuse Network

Introductory Tutorial for Blockchain Developers I: How to Study Substrate

Forta and Nethermind are delivering security and real-time monitoring on Balancer contracts!

Challenges for Decentralized Identity