Playbook For Cosmos Validators: Node Architecture Choices

7 min readNov 8, 2018

Keep your Atoms safe!

This is first introductory part of a series of posts dedicated to exploring the topic of staking and validation as Cosmos Validator.With launch of Cosmos Hub coming, we see a fast-growing Cosmos validator community. In the latest Gaia-8001 testnet, there are more than 200 validators securing the safety of distributed network with Tendermint. It’s the first time in history that a BFT consensus algorithm has been used in this kind of scale without significant liveness issue.

Table of Contents :

Part I. Anti-DDoS solution for validators

Part II. Key Management System for validators

Part III. Setup a Relay network between validators

The Cosmos Hub is based on a consensus engine called Tendermint. It relies on a set of validators to secure the network. The role of validators is to run a full-node and participate in consensus by broadcasting votes which contain cryptographic signatures signed by their private key. Validators commit new blocks in the blockchain and receive revenue in exchange for their work. They must also participate in governance by voting on proposals. Validators are weighted according to their total stake.

Common attacks for a validator node:

Distributed denial of service (DDoS) attack:In Cosmos network, a validator node can be the target of DDoS attack. Its fixed IP address and RESTful API port connected with the Internet make it vunlerable. DDoS attack will halt the vote messages between validators and prevent blocks from being committed. It has prompted exchange Bitstamp to halt Bitcoin trading, a
Compromise of keys: The most valuable asset for a validator is the keys it uses to sign blocks. Even if the keys are secured, if they can be used from the validator (such as to sign blocks), then an attacker who has control of the validator can get anything they want signed by the keys. Even if keys are secured, any copies of them in backups or on other support systems could still be compromised and used to clone the validator to malicious ends.
Trusted Link: validator systems is better communicate with sentries and support systems. These trusted communication links could be exploited to gain access to the validator. In particular, these communication links will typically be how the validator system is accessed for ongoing administration; if an attacker has access to the system that is logging into the validator (such as the sysadmin desktop), they can at a minimum piggyback on that remote session.
Tendermint Network Vulnerability: validators and Sentries at a minimum need to run the Tendermint network services. Any vulnerabilities here could be exploited by an attacker (either directly; or via malicious transactions, consensus messages, or blocks) to gain access to those systems.

Risk Control Methods

To protect the safety of validator node, one common solution is to setup sentry nodes. A sentry node is just a full node, which could be used to protect validator node from DDoS attack by constantly relaying the validator’s signed messages to public network. In this way, the flood could be mitigated.

More discussion could be found here: https://cosmos.network/docs/validators/security.html#sentry-nodes-ddos-protection

I have reasoned about 4 methods of controlling access, in order of resilience:

Firewall white listing. Tendermint’s port is closed by default, and opened only to a white list of static IP addresses of peers. This has the disadvantage that these nodes are still communicating on the public internet, and are no less vulnerable to DDoS attack than any other sentry. There is security by obscurity, since the public IP addresses should not be gossiped, but if an IP address is discovered it becomes vulnerable, and it is problematic to change since out of band co-ordination with peers will be required.
VPN connectivity. VPN connectivity can be established between relay nodes. These can be network-network IPSec, which is supported by major cloud platforms and most firewalls, WireGuard from host-host, or any technology stack mutually supported by two relay node operators. This still uses public internet, but in this case gaiad does not need a public IP address, so discovery of IP addresses to target with DDoS is harder.
VPC peering. For peers that are hosted within the same cloud platform and where the feature exists, VPC peering is established. This is possible within GCP and AWS respectively, I am not familiar with other platforms. The advantage over 1&2 is that there is no exposure to the public internet. The disadvantage is that both relay nodes need to be hosted on the same cloud platform.
Private link. Private connectivity can be established between validator operators. SDNs like MegaPort could be used at reasonable cost. I think this is likely not something anyone wants to get involved with in testnets, but it becomes a reasonable option on mainnet.

Anti-DDoS options are following:

Single Node Validtor Setup:

This is the simplest way to setup a valdiator node. Many nodes in Gaia testnet use this kind of architcture. The validator can only use its firewall to protect itself. This is approach is deemed unsafe, though you could set Firewall white listing to establish links only with trustful peers. If an IP address is discovered it becomes vulnerable, and it is problematic to change since out of band co-ordination with peers will be required.

Pro: easy to implement
Con: not flexible setup

Single Layer Sentry Node Setup

In this solution, the validator node hides behind its two-layer sentry nodes.Only sentry nodes use public internet, validator node does not need a public IP address, so discovery of IP addresses to target with DDoS is harder. The sentry nodes could become automated scalable. In this way, the support could easily recover from DDoS attack.

Pro: Efficient to mitigate DDoS attack
Con: Once the attacker gain access to private network, they could attack validator node.

Two Layer Sentry Node Setup

In this setup, the validator node hides behind it private sentry nodes.This will introduce an extra layer of protection. The private sentries are also full nodes, but they are connected with validator through trusted link. The votes of validator will be sent to private sentries, then relayed to public sentries.

Validator node normally sits at local data center, which provides HA redundency. Google Cloud and Amazon AWS have direct connection capabilities, operator should establish VPN connections between the data center and its private sentry nodes. VPC peering between Private sentries and public sentries will provide extra protection. This is possible within GCP and AWS respectively and many other platforms.

Pro: This is similar to the classical backend/frontend separation of services in a corporate environment.
Con:Inroduce failure of VPC, and increase of operation cost

Relay Network Setup

Relay nodes work like sentry node but connect in private network. The idea is to let validator nodes only connect to the relay nodes, then let relay nodes connect to public sentries. Relay nodes can then connect to each other in a private network.

Pro: The validator will be less likely to be offline
Con: More duplicate data, Need more configurations

The most interesting things from now on is the performance test results of each relay node option. Then individual validator can choose her or his best fit considering her “security/performance risk appetite”. Let’s test more in Gaia-9000.

Full Node Backup Method

As the height of blockchain grows, the size of data will continue to grow. So, it’s wise to backup blockchain data regularly. The backup process is the following:

Stop your Gaia Node
Make a snapshot of your node’s data folder, and compress the folder
Copy it to the correct folder like: $HOME/.gaiad/
Start the node

Note: if you want to start it faster, skip the pruning action, and start with

gaiad start --pruning=everything

For a more detailed explainiation with GCP cloud server, read this post:

Guide for creating Cosmos Sentry nodes quickly from a Snapshot on Google Cloud (Part 2)

Once your GCP Sentry node has synced up from Part 1, the next step is to snapshot the data disk. Creating the Snapshot…

forum.figment.network

In the next article, I will talk about the Key management of validators.

Happy Validating 💗

About the author

I am a blockchain engineer working for irisnet project. Currently, I am the testnet coordinator for IRIS and is a Cosmos validator. I regularly write down my thoughts. I am also interested in cryptoeconomics and mechanism design. To read more, go to my website.