PANIC for Polkadot: A Monitoring and Alerting Solution for Nodes

Dylan Galea
Simply Staking
Published in
8 min readJan 23, 2020

When running online services, ensuring their high-availability is of the utmost importance. In the case of Polkadot, validators need to be online to author new blocks for the relay chain, otherwise, they might risk getting their funds slashed. One can easily see the importance of having a monitoring and alerting tool that gives useful alerts, helping the node operator prevent dangerous scenarios such as the disabling of his validator. This is where PANIC comes into the picture.

What is PANIC?

PANIC is a lightweight yet powerful open-source monitoring and alerting solution developed and open-sourced by Simply VC. PANIC was first developed and open-sourced for the Cosmos ecosystem. Given the great feedback we received and the fact that we believe that it is a valuable tool, we decided to develop PANIC for Polkadot. The project also received support from the Web3 Foundation Grants Program.

The goal of PANIC is to help validator node operators monitor the behavior and state of their validators easily by sending useful alerts. But this is not all, PANIC can also be used to monitor non-validator full nodes, the blockchain(s) the nodes are connected to, and GitHub repositories.

Although PANIC is an alerting tool at its core, it also delivers a great user experience, which includes the capability to interact with it using Telegram commands. Telegram commands are used for two things, to have some sense of control over the behavior of PANIC, and to query data. This will become more clear as we go along in this article.

Speaking of which, this article consists of two main sections. First, we will present some of the important and useful alerts that can be raised by PANIC. In the second section, the design of PANIC will be presented from a high-level perspective, discussing how its constituent components interact with each other, and how the alerting is done using different channels according to severity.

Description of Alerts

As already pointed out in the introduction, PANIC can be used to monitor nodes (both validators and non-validator full nodes), the blockchain(s) the nodes are connected to, and GitHub repositories. Therefore, in what follows we will give some details on some of the alerts sent by PANIC for all objects that it can monitor.

Node

When we talk about the node alerts that PANIC can raise, we have to make a distinction between those alerts that can be raised for validators, and those alerts that can be raised for non-validator full nodes. In the current implementation, the former are a subset of the latter. The following is a list of some alerts that can be raised for both non-validator full nodes and validators, and those alerts that can be raised specifically for validators.

Alerts specific to validators: validator is active in current session, validator is disabled in current session, validator has been slashed, validator bonded balance increased/decreased.

Alerts for non-validator full nodes: node is down, node is (no longer) syncing with chain, node number of peers increased/decreased.

Blockchain

Blockchain alerts are used to give information about events that are related solely to the blockchain. For example, when a new referendum is created, an INFO alert is raised by PANIC informing the user that a new referendum has been created.

GitHub

In the current implementation there is only one alert in this category, where this alert is raised whenever a new repository release is issued.

High-Level Design

PANIC High-Level Design

The above image gives a high-level description of PANIC and how it interacts with external components. In what follows we will dissect this design such that by the end of this section you will have a basic understanding of how PANIC operates internally, and how it interacts with the external components. Let us start by focusing on the monitors.

Monitors

Monitors are a crucial component of PANIC. The job of the monitors is to retrieve data from the object being monitored (node, blockchain or GitHub repository), and passing this data to the reactive alerters. A question that you might ask is, how is this data obtained from the nodes? Fortunately, polkadot.js have developed a very robust API written in JavaScript to query data from Substrate nodes (This includes Polkadot nodes). However, since PANIC was written in Python, this API could not be used directly. Hence, a custom JavaScript API server was built as an intermediary component between the Polkadot API and PANIC. Having discussed the data retrieval process, it is now time to discuss the different types of monitors implemented in PANIC.

In the current implementation, three types of monitors are included. These are the node, blockchain and GitHub monitors. As you can imagine, the node monitor monitors a node, the blockchain monitor monitors a blockchain and so on. What is more important to mention is that when PANIC starts executing, monitors run in parallel and the number of monitors started up depends on what will be monitored. For example, if five nodes from two different blockchains, and three GitHub repositories were given to PANIC for monitoring, PANIC starts five node monitors, two blockchain monitors and three GitHub repositories. This also helps to increase fault isolation.

Having discussed the different types of monitors, it is now time to present the alerters, and how they interact with Redis and the channels to send alerts to the operator.

From Monitoring to Alerting

So far we have discussed how data flows from nodes, blockchains and GitHub repositories to specific monitors, and then to the reactive alerters. Now we will discuss the role of the different types of alerters in general and how they are connected with other components of PANIC. In PANIC, the alerters can be subdivided into two disjoint categories, reactive and proactive

Let us first consider the reactive alerters. In PANIC there are three different reactive alerters: the node, blockchain and GitHub alerters, where a separate alerter is assigned to each monitor when PANIC starts, based on the type of the monitor. The job of each reactive alerter is to receive data from the assigned monitor, compare it with its saved local data from the previous monitoring round, and alert if necessary through the enabled channels. After doing this, the reactive alerter stores the data in its local memory and in Redis. Data is stored in Redis so that if PANIC is restarted, the alerter can continue from where it left off, and more importantly, it is used for data queries when using Telegram commands.

For a better understanding of the job of the proactive alerters, let us consider the following example. Suppose that the number of peers of a node decrease. In the next monitoring round, the node monitor associated with this node extracts the data and passes it on to the assigned node alerter. When the node alerter compares the new data with its local data, it notices that the number of peers of the node have decreased. As a result, the node alerter sends a meaningful alert to the operator via the enabled channels.

The other type of alerters in PANIC are the proactive alerters. In contrast with the reactive alerters, the proactive alerters do not receive data from the monitors. This means that the proactive alerts do not generate alerts based on state changes in the node, blockchain or GitHub repository. Currently in PANIC there is only one proactive alerter, the periodic alive reminder. The periodic alive reminder notifies the operator using the enabled channels that it is still running. This is very useful when the alerter has not given alerts for some time leaving the operator wondering whether PANIC is still working or not.

Having described the alerters, it is now time to delve into more detail about the alerting channels and the severity of alerts.

Channels and Alert Severity

What alerting channels does PANIC support? Are all channels used for all types of alerts? Before discussing the different types of channels supported by PANIC, it is paramount to first discuss the different type of alert severities.

As you might imagine, not every change in state is critical in nature. As a result, not every alert sent by PANIC is critical. For example, a new GitHub repository release is not as critical in nature as the downtime of a validator. Due to this, PANIC uses four different types of severities for alerts:

  • INFO: Alerts of this type have zero to little severity but consist of information that may still be important to acknowledge. Example: increase in bonded balance.
  • WARNING: Alerts of this type require attention as they may be a warning of an incoming critical alert. Example: validator is not elected to validate for the next session.
  • CRITICAL: Alerts of this type are severe in nature. Example: validator has been slashed.
  • ERROR: Alerts of this type are triggered by abnormal events and range from zero to high severity based on the error raised.

Let us now answer the questions raised at the beginning of this subsection. At this time of writing, PANIC supports the following five alerting channels:

  • Twilio: Alerts are raised in the form of a phone call.
  • Telegram: Alerts are sent via a Telegram bot to a Telegram chat.
  • E-mail: Alerts are sent as emails using an SMTP server, with option for authentication.
  • Console: Alerts printed to standard output.
  • Log: Alerts logged to an alerts log.

Alerts of all severities may be sent through the Telegram, e-mail, console and log channels. On the other hand Twilio is only used when alerts are critical in nature.

By default, the console and log channels are always enabled as these require no configuration and are important for when the operator wants to check what is happening during monitoring and alerting. That being said, it is up to the user to decide whether to enable the Twilio, Telegram and the e-mail channels.

As mentioned in the introduction, a cool feature of PANIC is that the user can interact with it via Telegram commands. The following subsection will give a very basic description about the different commands that can be inputted.

Telegram Commands

Apart from sending alerts to the operator, Telegram bots are also used to accept and handle pre-defined commands sent by the operator. Using Telegram commands, the operator can check the status of PANIC, snooze or unsnooze calls, mute or unmute the periodic alive reminder, and conveniently get Kusama CC3 (Polkadot in the future) explorer links to validator lists, blocks, and transactions. The image below gives a summary of all the commands that can be handled.

Acceptable Telegram Commands

What’s Next?

Although we think that PANIC will be an indispensable tool for validator operators within the Polkadot ecosystem, we will not stop there. The next step for us is to give a better user experience when using PANIC. This will be achieved by developing a web user interface which can easily show the status of the nodes and the alerter. The user interface will also provide the operator with capabilities to switch on/off specific alerts, thus providing a more customizable user experience. In addition to this, the user interface will have the capabilities to more easily set up and (re-)configure PANIC.

If you have any questions, comments or suggestions please do not hesitate to contact us. You can use any contact method listed in the About Us section on our website or contact me directly on my Telegram @dylan_galea.

Links

Who we are:

Resources:

Code:

--

--