SDBS #15 | Stabilizing the Network: Asynchronous Block Validation

Published in

stealthsend

7 min readMar 23, 2019

Nodes validate timestamps against other block timestamps, not against any internal clock

This week focused on creating a more stable qPoS network, where nodes can reliably stay on the same chain even when they individually or collectively have connectivity problems. To improve network stability, I had to make significant changes to the registry synchronization by abandoning reliance on the system clock. Instead qPoS will use peer-based information and chain trust to resolve timing conflicts. These approaches resemble solutions that are well known in distributed systems.

— — — — — — —

Abandoning Clock-Based Synchronization

Early in this blog series in SDBS #4, I described how several loops, each running in its own thread, would coordinate to both produce and validate new blocks. The two central loops of this scheme were the block production loop and the registry synchronization loop. The block production loop monitors the system clock until it is the client’s turn to mint a block. This monitoring takes the form of periodically checking the system clock time against the staker queue. The registry synchronization loop ensures the staker queue is always up to date. Any new blocks will be checked in a third loop called the message handler loop, also running in its own thread. These blocks are validated against the queue which should be up to date.

It turns out this system works fine for a single node, assuming that any multithreading is properly handled. However, a network of nodes will have two types of problems that make this approach exceedingly fragile. First, the system clocks are not guaranteed to stay synchronized. It is possible to strongly suggest the use of highly accurate network timekeeping protocols, such as NTP, but it is impossible to enforce the use of these protocols. Second, and more irresolvable in terms of synchronization, is that network connectivity can be broken, meaning that nodes are not guaranteed to stay in communication with the network at all times. This loss of connectivity can happen according to the protocol, where nodes randomly drop and connect to peers to improve network topology. In other cases, a node may simply have internet issues, say during power outages or when its ISP has problems. Even the most sporadic internet issues can cause significant consensus problems.

At the very least, loss of connectivity means that blocks will be delayed for a node that has been temporarily disconnected. If the internal queue of a node advances while blocks are delayed, this node will assume that these blocks are late and reject them even if these blocks are perfectly valid. Obviously, these types of rejections are the fault of the node that lost connectivity and not the fault of the node which produced the blocks.

This scenario highlights that a node should not generally reject blocks that don’t agree with its system clock. Furthermore, a node should treat its own system clock as “just a suggestion”, relying on events received from its network to determine timing. In a blockchain, these timing events should be the blocks themselves. This treatment sharply contrasts the approach I described in SDBS #4.

The new approach for qPoS uses asynchronous timing, where the queue (which is a type of clock) is advanced only when the client connects a new block to its blockchain. The timestamp of the connected block gives the queue its current time and causes the queue to advance accordingly. The queue remains in this state, only to be advanced upon the connection of another new block. Notice how this “new” logic is opposite from the “old” logic I described in SDBS #4:

Old Logic: The internal queue advances according to the system clock. Blocks that do not arrive at the right times will be rejected.
New Logic: New blocks are validated without regard to the system clock. They carry timestamps, and these timestamps advance the queue.

One consequence of this new logic is that a node must trust block producers with all timing information. This seems like a lot of trust, especially when the validity of a block is contingent on its timing. If a node must trust block producers to keep time, how can the network ensure block production occurs at nearly exact five second intervals?

On the most fundamental level, nodes that don’t produce blocks are simply observers. They don’t own stakers, which represent the operational stake in the network. Since observer nodes don’t own operational stake, they should be comfortable letting the producing nodes keep time and ensure that blocks come with good timing. Does this marginalize holders of XST who don’t run stakers? Perhaps, but this situation is no different from PoW (as with Bitcoin) or non-stakers in PoS (like exchanges). In short, plenty of precedence exists in the area of cryptocurrencies for relying on consensus nodes to exclusively keep network time.

How does a producing node keep its peers in check? How can it produce blocks on time while waiting for delayed blocks from the network? If the queue doesn’t advance while a producing node is waiting for delayed blocks, how will the block production loop know it is time to produce a block?

The answer is that a node will produce a block on contingency, and even connect this contingent block to its own blockchain. If this block skips some time slots, then it essentially means that this node thinks one or more of its peers have missed blocks. Producing a block as if a peer has skipped a block is a sort of accusation, and a serious one at that. In fact, the qPoS protocol penalizes nodes whose immediate predecessors in the queue miss too many blocks.

If a node has minted a block on the assumption that its immediate predecessor has missed its block, and then this node receives the delayed blocks (assuming the blocks were delayed and not actually missed), this node will disconnect its own block and reconnect the missed blocks, but only if the delayed blocks make a chain with more trust than its own. Chain trust is calculated using a metric known as staker weight, which is described in the qPoS whitepaper. This event where one branch is disconnected and another reconnected is called a reorganization. Nearly all cryptocurrencies use reorganizations to resolve forks.

I feel like this approach to timekeeping has very robust logical foundations. (1) It ultimately uses network events as a clock. (2) It allows input from each block producer’s internal clock, and this input is incorporated by the network timekeeping apparatus if the network agrees with the internal clock. (3) It uses chain trust to resolve disputes, where chain trust is based on prior behavior of individual stakers. (4) Finally, it uses well established methods to reorganize a chain based on forks.

— — — — — — —

Suggested Timekeeping for System Clocks: NTP

The above discussion explains how the network clock will essentially be asynchronous, relying on block ordering and block trust. However, blocks must contain timestamps. The practical reason for timestamps in qPoS consensus is that the block producer is checked against the queue and the queue is organized chronologically. A chronological queue is required to ensure precise block spacings of five seconds.

For real-world practical reasons, and to reduce conflict on the Stealth network, block timestamping should be based on reliable timekeeping that is as consistent as possible with official time, such as that kept by NIST. For this reason, the existing Stealth network clock will be abandoned. Stealth presently uses a network clock where each node samples its peers for their clock times, then calculates an internal “clock drift” that the node applies to its system time. This means that all nodes on the Stealth network agree very well on the network clock, but this network clock is not necessarily synchronized with official time. In fact, for nearly three years, the Stealth network time has been behind official time by over 13 minutes and is now drifting closer to 15 minutes. This drift can be observed by visiting the Stealth explorer at chainz.cryptoid.info/xst/. If you are reading this post shortly after publication, you will notice that the newest block is 15 minutes old, as seen in the following screenshot:

The newest block on the Stealth network is not actually 15 minutes old. Rather, its timestamp is 15 minutes behind because the Stealth network clock has been drifting backwards for years without a correction. These erroneous timestamps have caused very few problems in the past, with the exception of interfering with some third-party services, most notably Ledger Wallet. These issues were fixed when timestamps were removed from transactions in Stealth version 2.2. Although currently causing no identifiable problems, the disconcerting drift of block timestamps still remains. Once clock drift is removed from the Stealth network timekeeping protocol upon the transition to qPoS, blocks will no longer have this discrepancy.

For qPoS, we recommend moving to NTP for clock synchronization. Even the built-in Linux timesyncd will not be as accurate as running NTP. We will provide instructions describing how to setup NTP soon, but please note that block producers need to use a suitable hosting provider as outlined here. In summary, staker operators should run bare metal servers or VPS with a virtualized clock driver, remove or disable any timekeeping daemons like timesyncd, and install an NTP daemon, like ntpd.

— — — — — — —

Other Improvements This Week

This week I eliminated most useless qPoS block production, reducing CPU usage for a single staker on a low-end VPS to about 2% usage of a single core. Some useless production is inevitable in the case of microforks, which are a necessary aspect of every true blockchain.

I also added the testnet RPC exitreplay that jumpstarts qPoS minting on a network with only a few nodes. The exitreplay RPC is not needed for mainnet when block production is expected to never stop.

— — — — — — —

Hondo

— — — — — — —

Website / Telegram / Slack / Medium / Twitter / Reddit

SDBS #15 | Stabilizing the Network: Asynchronous Block Validation

Nodes validate timestamps against other block timestamps, not against any internal clock

Abandoning Clock-Based Synchronization

Suggested Timekeeping for System Clocks: NTP

Other Improvements This Week

Written by Stealth