Discontinuing Tendermint Core v0.35: A postmortem on the new networking layer
A retrospective on the v0.35, what happened, how it happened and the decisions we’ve made going forward
Tendermint Core v0.35 was released in November 2021. Among other work, v0.35 included the first phase of a peer to peer (p2p) refactor which modified several abstractions with the aim of preparing the codebase for larger subsequent changes (i.e. libp2p). The refactor included an overhaul to: the Router – responsible for dialing and accepting connections between peers and routing respective packets to and from modules; the Peer Manager – responsible for storing peer information and tracking peer lifecycle, and lastly the PEX Protocol – responsible for gossiping peer addresses between nodes.
Due to delays in the release, v0.35 wasn’t initially integrated with the SDK. It wasn’t until spring of 2022, that the Penumbra, Celestia and Vega teams began to adopt v0.35 in their testnets. The teams experienced substantial instability: connected peer count was constantly fluctuating, causing consensus to take multiple rounds to produce a block and periodically stalling. With the help of these teams, the Cosmos Consensus engineering team began to diagnose the symptoms.
There were several problems that were subsequently uncovered and resolved. First, the two PEX protocols which should have been cross compatible sent different amounts of peer addresses that caused nodes to disconnect from one another. Second, once disconnected, peers weren’t reconnecting and operators observed unusually low peer count. This was largely because new parameters had never been properly calibrated. As it turned out, the new p2p router was dialing peers magnitudes slower than the legacy switch. Additionally, there were cases that upon error in either of the dialing and accepting threads, would cause the thread to entirely stop instead of continuing, thus no new connections were being made.
To better understand the state of the networking layer, the engineering team developed a testing harness for deploying hundreds of geographically dispersed nodes. From these, it was observed that newer nodes often struggled to connect with others. This was a flaw in the design which went from a cap on incoming and outgoing connections, to a single cap on connections. This coupled with a greedy dialing algorithm meant all other nodes would fill their connection slots and thus not allow for a new node to join. By reverting back to two separate caps, nodes would still have free slots and the network was able to support larger partially connected networks again.
The last major fault in the system was with the priority queue which buffers and orders messages to be sent to peers. The queue wouldn’t push messages out unless a new message was added to the queue. This occasionally meant messages that could’ve been delivered were blocked. After implementing a simpler priority queue, the team is currently able to internally run a 200 node network that exhibits a similar behavior to v0.34.
Reflecting on the failure to deliver a stable v0.35, the lack of continuity within the team was the most apparent. The engineer that designed the new system, the engineer that implemented it, and the team that finally released v0.35 were all different. The new team that inherited the codebase was under the impression that the refactor was internal plumbing and were unaware of the extent that the behavior had changed. They were overly reliant on the small scale e2e tests for detecting regressions.
To prevent a future recurrence, the team has spent a lot of time updating the QA process which will include large and long-lived networks. These can provide greater insight into overall performance. Furthermore, the team is looking to better coordinate releases with the SDK, essential in imitating production-like environments.
The Cosmos Consensus Council, as part of its risk mitigation strategy, has decided to sunset v0.35. The features in that release (as well as v0.36) will be re-evaluated and broken down into smaller releases. You can read more about the near-term future of Tendermint Core here.
Finally, thanks to all the teams that worked alongside us. Your cooperation and feedback is extremely valuable in moving Tendermint Core forward.
The Cosmos Consensus Council is a community initiative, formed and coordinated by contributors of Tendermint Core from across the Cosmos Ecosystem. While the council makes recommendations on the product roadmap and helps coordinate development efforts, it is not associated or led by the Interchain Foundation or Interchain GmbH.