Improving Liveness and Reliability in the Interchain Stack
Liveness is a crucial concept in the Interchain Stack and distributed systems in general as it affects operational reliability, scalability, and performance. Liveness enables all nodes to communicate effectively and consistently, reducing downtime, detecting failures, and improving overall system performance. Due to the importance of liveness, developers working with the Interchain Stack often make assumptions about its liveness guarantees. However, due to the inherent challenges of decentralized blockchain development, the stack components do not provide distributed performance guarantees today. Let’s explore these challenges, the concepts of liveness and safety in the Interchain Stack, and any liveness misconceptions for the benefit of builders and the sustainable growth of the ecosystem.
Safety Over Liveness — A Core Design Principle
Liveness in the Interchain Stack — comprising Cosmos SDK, CometBFT, CosmWasm, CosmJS, and IBC — is the property that enables the system’s continuous operation and availability, ensuring all nodes can communicate with each other effectively and consistently. While liveness is important in distributed systems where various nodes work together to maintain the integrity and functionality of the network, the concept of safety over liveness remains a core design principle of the Interchain Stack and most Layer One blockchain technologies in use today.
The Interchain Stack’s robust architecture and performant track record can lead to misconceptions about liveness and the distributed performance guarantees the Stack provides. This can result in potentially unreliable and unsafe applications. Prioritizing safety over liveness is essential for the overall health and security of the Interchain Stack, enabling it to provide a reliable and scalable foundation for building innovative blockchain applications and securing billions of dollars of digital assets.
Challenges with Liveness and the Interchain Stack
Decentralization relies on multiple parties to perform compute, transmission, and storage of chain data while reaching a unified consensus to produce blocks. This introduces unknown environmental and runtime factors over which the Stack has no control, making it challenging to provide performance guarantees at the protocol level. Given the complexities of building distributed systems, promising such liveness guarantees would be irresponsible and the stack has never made such claims.
When working with the Interchain Stack, developers must not make liveness assumptions and should design their applications accordingly, specifically:
- Developers should not assume consistent block throughput or timely events. Variability and potential delays of unknown frequency and duration may occur. This should be accounted for during application design.
- Developers should not rely on a chain’s past performance to remain at the same level in the future. They should account for possible changes when designing their systems to ensure stability and reliability.
- Development teams should implement mechanisms that manage timeouts, delays, and retries to guarantee applications remain functional under varying network conditions.
- They should consider resource constraints and optimize the use of compute resources, storage, and bandwidth during implementation. This will help create more efficient and reliable applications.
Strategies to Mitigate Liveness Issues
As the interchain evolves and more use cases and needs emerge, the Interchain Stack steward teams are examining strategies to mitigate liveness issues at the protocol level and exploring opportunities to enhance the system’s components to optimize performance. Here are some examples below:
CometBFT
CometBFT, the consensus engine designed for the interchain and beyond, provides several strategies to address liveness issues caused by spam or traffic surges. These mitigations help maintain network performance and block production times. One such strategy involves reducing the default BlockParams.MaxBytes from 21 MB to 4 MB, limiting block sizes and reducing the impact of spammy transactions.
CometBFT also incorporates several other measures to enhance its resilience and predictability in the face of transaction surges and improve the reliability of CometBFT-based networks. These include:
- Adjusting timeout parameters for proposing larger blocks
- Maintaining consistent node configurations across the network
- Avoiding large mempools that can negatively impact performance
- Delegating mempool responsibility to the application layer through an optional “nop” feature
- Implementing a backpressure mechanism to manage transaction flow
- Supporting async mempool updates from block.Commit
Future Improvements in CometBFT
The CometBFT team is working on further improvements to mitigate liveness issues caused by spam surges and overloaded mempools. One of these improvements introduces Quality of Service (QoS) guarantees for the mempool protocol. This design aims to prevent the pressure from spreading to other components within CometBFT by adding mechanisms that allow nodes to drop some transactions in certain circumstances. The QoS approach is known as “mempool lanes,” where transactions with higher priority will have stronger guarantees and better predictability in terms of propagation and inclusion in a block.
Another important feature is the Dynamic Optimal Graph (DOG) gossip protocol for the mempool, which is designed to improve transaction dissemination efficiency and reduce duplicate transactions. This algorithm extends the base FLOOD protocol by eliminating cycles in transaction dissemination, resulting in reduced bandwidth usage and better performance under high-traffic conditions. The CometBFT team has observed a 75% reduction in transaction dissemination bandwidth in initial testing.
Both the above features will be incorporated in v1 and may be backported to v0.37 and v0.38. Developers using older versions of CometBFT should plan their transition to v0.38 and then to v1, as the upgrade process can be difficult. The team is available for support and to contribute to network upgrades and spam issue mitigation, and can be reached on their Telegram and Discord. Additional resources can be found at the end of this post.
Hermes IBC Relayer
Hermes, an open-source relayer built in Rust, is a key component in ensuring liveness within the interchain ecosystem by relaying IBC datagrams between chains. This relayer scans chain states, builds transactions, and submits them to the involved chains, enabling smooth communication across the network.
To mitigate liveness issues, Hermes incorporates several strategies to manage network traffic and avoid overloading. For instance, it sets maximum size limits for the memo and receiver fields in ICS20 packets, rejecting packets that exceed the configured sizes, with defaults of 32KiB for memos and 2KiB for receivers. Additionally, Hermes allows operators to specify packet sequences that should not be cleared, preventing the relayer from processing problematic transactions and improving efficiency. Furthermore, Hermes provides options for selectively clearing packets via a command-line interface (CLI), giving operators more control over transaction flow.
These measures, along with robust logging and debugging, enhance Hermes’ ability to maintain reliable IBC communications and support the scalability of the Interchain Stack.
CosmJS
CosmJS, a client-side library for interacting with the blockchain, contributes indirectly to maintaining liveness in the ecosystem through its integration with other tools like Chain Registry and Cosmos Kit. These tools are crucial in ensuring live nodes are always available for interactions.
The Chain Registry maintains an up-to-date list of public endpoints for various blockchain nodes, providing redundancy if one node goes down. Similarly, Cosmos Kit leverages this registry to cycle through different RPC endpoints if a node failure occurs. This ensures that applications built with CosmJS can continue functioning even if a node fails by automatically finding and connecting to another live node.
While CosmJS does not directly address backend liveness guarantees, it plays an essential role in the broader context by ensuring robust client-side interactions. Good error handling practices in CosmJS can mitigate the impact of liveness issues, such as implementing retry mechanisms and user-friendly error messages to improve the overall user experience. By maintaining a smooth and reliable client-side interface, CosmJS indirectly supports the liveness and usability of applications built with the Interchain Stack.
Final Thoughts
To maintain liveness guarantees and support the sustainable growth of the ecosystem, developers must understand the challenges of distributed systems and design their applications accordingly, working together, and following the recommendations specified above. The Interchain Stack steward teams continue to explore opportunities to enhance system components, mitigate liveness issues, and optimize performance — and they need your help.
Community feedback can shape the development of the Interchain Stack and contribute to building a more reliable and scalable foundation for life-changing software and a brighter future for the humans who use it. Join our Discord channel and follow us on X to stay updated on the latest developments, ask questions, and share your thoughts with the community. Let’s craft the future of the interchain together.