Testing Tezos networks at scale

Roland Dowdeswell
The Aleph
Published in
4 min readOct 8, 2021

Software resilience testing ensures that applications will behave as expected in real-world and potentially chaotic conditions. A large system like Tezos includes many functional tests as part of its CI/CD pipeline. Functional tests are a great way to ensure an implementation matches a specification. As part of a comprehensive testing plan non-functional components of a system must be exercised, too. Non-functional testing addresses areas like security, reliability, recovery, stability, usability, and scalability. Resilience testing complements the other testing that we do as a large realistic deployment of Tezos nodes may exhibit bugs which do not occur in smaller scale tests.

We have been building the capability to perform resilience tests of realistic networks of nodes. To quickly construct large networks of nodes, we use tezos-k8s. The tezos-k8s project uses kubernetes to enable various runtime scenarios for a cluster of nodes. We have already identified and fixed bugs based on the work. Our work is very much ongoing and we are continuously improving our frameworks to be able to create more realistic scenarios. One goal is to provide ways to reproduce issues that have occurred during real world network events so that we can ensure they do not happen with future versions of the protocol. One scenario we are working to reproduce is the issue that occurred during the Granada transition where the chain slowed down and required an emergency rollout of software fixes to node operators.

Testing Emmy*

In April 2021, we set ourselves the task of testing the protocol upgrade from Emmy+ to Emmy*. Our objective was simple: we wanted to create a large network of at least 400 nodes with some load on it with a network diameter of around 5 and ensure that it did not perform worse than Emmy+. Specifically, we ensured that the network stayed up, blocks continued to bake, we had no chain reorganisations, most blocks were baked at priority zero, and most endorsements were included.

In order to isolate the change, we compared Emmy* to Emmy+ at the same point in the git tree by creating two branches which were different only in that one had the code for Emmy+ and the other had the code for Emmy*.

To ensure that we had a network with a diameter of around 5, we implemented a private node patch to tezos-k8s which made each pod optionally start a private node and a public node. In this setting the baker would attach only to the private node and the private node would attach only to the public node. We also started each public node with --connections 10 to further increase the diameter of the network.

For each run, we generated a set of reports like these:

As you can see from the above, in the network running Emmy*, a considerable ratio of endorsements were not included. So, together with Nomadic Labs, we instrumented the code and discovered:

  • a mempool issue where operations could be lost if there were more than 50 operations stacked, typically after a new head is set (see !2980),
  • a race condition where endorsements could be dropped because they arrived in the mempool before the block they endorsed (see !2980), and
  • baker performance needed to be improved to accommodate the faster block generation and increased number of endorsements (see !2746)

These issues were not new in Emmy*, but the changes to the protocol significantly increased their likelihood and/or impact. After fixing these issues, we finally got a reasonable percentage of endorsements included.

These issues were fixed by Nomadic Labs in v9.2.

Next steps

As we have seen, our first round of large scale testing did prove to be useful and did uncover a number of issues but, unfortunately, our simulation did not detect the combination of bugs that led to the Granada incident. Together with Nomadic Labs, we are working on being able to consistently reproduce the bugs that were seen. To do this, we have identified several areas for improvement.

We need to run more heterogenous networks and so we have improved tezos-k8s to be able to select different images for different nodes so that we can run different versions of Octez. We also added support for TezEdge for additional heterogeny.

In our testing above, we patched tezos-k8s to run a private node behind each public one. We are currently designing a way to specify the network topology so that we can create more realistic networks. This topology will include specifying the latency on links between the nodes so that we can accurately simulate a global deployment of Tezos.

We also need to ensure that the chain we are testing is realistically large. We are adding the ability to run our tests starting with a pre-loaded snapshot of mainnet. In order to make this work, we are circumventing the signature checks so that we can assign bakers to existing accounts without knowing their private keys. We hope that this will allow us to more accurately reproduce issues that could occur in the wild.

Ultimately we are striving to create testing environments significantly more representative of the real world and therefore more useful for the network.

Watch for updates

Resilience testing and infrastructure development are under active development. Future posts will detail some of the platform tools and techniques used to build out this capability. We also plan to post updates regarding any interesting results we discover during the next protocol update.

--

--