“The biggest reason for failures in IT environments is human error. Redundancy tends to increase complexity, and complexity increases the chances of human error. So just increasing the number of redundant systems doesn’t necessarily increase your overall reliability. And this is the key to using redundancy effectively: you need to keep the complexity under control.”
As a relatively new validator within the Cosmos network, but already a long-time baker on Tezos, we would like to lay out our high redundancy staking environment. The probability of any device or component failing, always needs to be considered as an increasing probability over time. Should a validator misbehave, each of their delegators will be partially slashed in proportion to their delegated stake. This is why delegators should perform due diligence on validators before delegating, as well as spreading their stake over multiple validators. It’s very important for us as a validator to keep your risk as a delegator as low as possible —aiming towards zero.
What are the slashing conditions?
If a validator misbehaves, their delegated stake will be partially slashed. There are currently two faults that can result in slashing of funds for a validator and their delegators:
- Downtime: If a validator misses more than 95% of the last 10.000 blocks (around 14 hours), they will get slashed by 0.01%.
- Double signing: If someone reports on chain A that a validator signed two blocks at the same height on chain A and chain B, and if chain A and chain B share a common ancestor, then this validator will get slashed by 5% on chain A.
Equipment to protect our nodes against downtime
- Two physical DELL R710 nodes with Intel 5500, 32GB RAM, 2TB SSD (2)
- Two pysical SOPHOS firewalls to secure our network (2)
- Two internet providers connected to our nodes (Swisscom primary, Salt secondary) with an additional USB GSM module (2)
- Two HSM Ledger Nano S (one actively signing blocks, second as backup)
- Two DELL PowerConnect 2848 stacked (1)
- Uninterruptible power supply smart-ups 3000 XL (3)
Summarized it means that we can handle following failures:
- power failures and power outage
- internet connection failure / downtime
- firewall downtime
- DDOS attacks
- hardware failures (node, firewall, switch)
- OS or software issues on our node -> switching to node 2
For very rare and unexpected scenarios like fire and earthquake we can set up a new validator in very short period of time by using our backed up images from our node and run this new validator in a different datacenter. As we have iDRAC configured, we can also access our node if they are turned off over an IP address. In addition to our image we also backup our blockchain data from $HOME/.gaiad/ regulary in an encrypted backup.
“In addition to running a Cosmos Hub node, validators should develop monitoring, alerting and management solutions.” — Cosmos Network
This is exactly what we do with PRTG NETWORK MONITOR. PRTG monitors every aspect of our IT infrastructure. An alert will be sent as a failure occurs. This can be internet downtime, hardware failure, node inactivity etc. Every alert will be forwarded to our telegram bot which will inform us about the current state on our smartphone.
Prevention against double signing
Being slashed for double signing is very rare but still a risk. Double signing means being slashed by 5%. This is a lot more than 0.01% after a downtime of 10'000 blocks. This is why we worked out a solution to minimize this risk to nearly zero again.
Our solution is pretty simple but works very well. Remember that it is important “to keep the complexity under control”. Our second node only has access to priv_validator.json and to priv_val_state.json when we can be absolutely sure that node 1 validator is not running anymore. This can be verified very easily by iDRAC API or checking node status. Priv_val_state.json is a kind of double sign protection where current block height can be compared from node 1 to node 2 and copied before starting validating on node 2. Switching to our second node can be easily and safely done.
We simulated a hardware failure on Friday, 10.5.2019 to go through all the steps necessary. Therefore, we only missed 43 blocks which is around 5 minutes of not being online. Our scripts worked perfectly fine by checking with an API request if node 1 is running before starting validating on the second node. Double signing was never at risk. It’s also clear for us that in a real scenario, we couldn’t react in the same short period of time. But because we exactly know how to handle similar scenarios, our downtime will be far less than 10'000 blocks if a bigger problem occurs and double signing will not be at risk.
Delegate to Cosmos Suisse — cheap and secure
Delegate and support us by using:
- Lunie.io, Ledger Nano S required → staking guide here
- Cosmostation, mobile wallet → staking guide here
If you don’t feel safe about your current validator, remember that you can redelegate your staked coins instantly the first time. You don’t need to unbond for that. Feel free to join Cosmos Suisse based in Crypto-Valley, Switzerland.
It makes also sense to split your stake over multiple validators. Splitting across multiple validators means better risk management for the delegator and also helps the network become more decentralized.