Hierarchy of Tests

Published in

Tendermint Blog

7 min readMar 11, 2021

Just as people in northern countries dress themselves in multiple layers of clothes, it is also a common practice to wrap complex systems in various layers of tests. More layers of clothes give you a greater chance of not catching a cold whereas more layers of tests provide a higher probability of catching a bug before the system goes live.

You can read about the basic levels (“layers”) on Wikipedia.

Note: In this article, I am going to talk about Tendermint Go implementation, referred to as Tendermint for simplicity.

A little bit of history (2014–2019)

Since the beginning of the project (2014), Tendermint has unit and integration tests. (Remember that unit tests are usually used to check individual functions and live close to the code. Integration tests verify the interaction between two or more software components.) However, coverage was minimal. These tests provided developers with a sense that the code was working and allowed them to iterate quickly.

After approximately two years of development (2016), end-to-end tests (see “system testing” on Wikipedia) were added in the form of Bash scripts. (Unlike integration tests, end-to-end tests verify a completely integrated system.) End-to-end tests further boosted developers’ confidence in the implementation by checking common scenarios (continuously making blocks, fast syncing, restarting a node, p2p networking, ABCI application failure).

In September 2017, Tendermint was verified using Jepsen. Jepsen tests that Tendermint conforms to its linearizability guarantees. (If you don’t know what Jepsen or linearizability are, don’t worry. We will discuss it later.) No significant issues were found. You can read the full report for details.

In March 2019, Cosmos Hub (a major piece of the Cosmos Network) was launched. It was a huge success. But as we all know, “with great power comes great responsibility.” 🕷 It was time to double down on Tendermint Core’s overall resiliency and stability.

Last year and present time (2020-2021)

At present, we have the following levels:

Unit tests
Integration tests
End-to-end tests
Maverick tests ensuring slashing model (validators get punished for duplicate votes)
Model-based tests for light client
Fuzz tests
Jepsen tests

Let’s talk about each of them. I am going to skip unit and integration tests (although important, most people are familiar with them already).

1) End-to-end tests

As a part of a stability effort, the new end-to-end suite was developed last year. Bash tests were removed. The new suite is written in Go (programming in Bash is no fun), has a clear API and a human-readable configuration specified in TOML.

We have also setup a nightly CI (Continuous Integration) job, which runs the test suite against various network configurations (which are randomly-generated each time) and reports back any failures to Slack.

For example, here is the network configuration we’re using with CI for each Pull Request on Github:

The complete file can be found here.

The perturb setting is used to introduce different perturbations like restart, disconnect (from peers), and so on. The misbehaviors setting is used to test the slashing. Under the hood, it is using the Maverick node to act in a byzantine way. We will talk about the Maverick node next.

2) Maverick tests ensuring slashing model

Image source: https://eu.azcentral.com/picture-gallery/entertainment/movies/2015/06/29/top-10-movies-for-a-fabulous-fourth-of-july/29490693/

Before last year, only one integration test was defined, which ensured duplicate votes (validator voting for two different block at the same height/round) would lead to punishment. That wasn’t enough. That’s why the Maverick node was created. Think of it as a regular node, except it can be configured to manifest byzantine behavior at different stages of the consensus.

At each of these events, the node can exhibit misbehavior. This allows us to cover more byzantine scenarios.

3) Model-based tests for light client

The Informal Systems team has been doing an amazing job bridging formal verification and regular testing together.

Formal verification is an act of proving the correctness of an algorithm with the respect to a certain formal specification. Source: https://en.wikipedia.org/wiki/Formal_verification

The main problem with formal verification as I see it is that an implementation may not (exactly) match a formal specification. In this instance, you might wrongly presume that your implementation is correct. In summary, formal verification can be used to check your ideas, but it doesn’t check your implementation.

The team at Informal Systems is trying to solve this problem by using the Apalache model checker to generate counter examples, based on the TLA+ model of an algorithm (e.g., the light client). These are then used to generate concrete unit tests. That way you have some guarantee that your implementation is correct.

You can learn more about their work by watching this presentation (and reading the accompanying paper):

The brilliant thing is that test fixtures can be used to test both Go and Rust Tendermint implementations. You can find Go model-based tests and documentation here.

4) Fuzz tests

Somewhere around 2017, we wrote a few fuzz tests. (Fuzzing is a testing technique where the special program — fuzzier generates random bytes and calls a function under test with the bytes as an argument. If the function panics, then, most likely, you have found a bug! You can read more about fuzzing on Wikipedia.) The go-fuzz library was used to test various inputs like mempool’s CheckTx, RPC server, etc.

Last month (Feb. 2021), we moved them into the main repository and added a nightly CI job, which runs the go-fuzz fuzzier against each input for 10 minutes and reports back any crashers (byte sequences leading to panic) to Slack.

Here is the list of things we currently test:

[mempool] CheckTx (with kvstore ABCI application)
[p2p] PEX address book AddAddress
[p2p] PEX reactor Receive
[p2p] SecretConnection Read and Write
[rpc] JSON-RPC server handler
[consensus] WAL encoder and decoder
[pubsub] query implementation

The suite is not complete and will undergo some changes in the future. Check out #5959 for details.

5) Jepsen tests

Screen capture from Kyle Kingsbury’s keynote talk at Scala Days 2017

Jepsen is a framework for verifying distributed systems. Jepsen has been used to verify everything from eventually-consistent commutative databases to linearizable coordination systems to distributed task schedulers. It can also generate graphs of performance and availability, helping you characterize how a system responds to different faults. Source: https://github.com/jepsen-io/jepsen/blob/main/README.md

Jepsen tests give us some assurance that Tendermint produces linearizable history in the presence of:

Network partitions
Clock skews
Crashes
Changing validators
Truncating logs

Linearizability is one of the strongest single-object consistency models, and implies that every operation appears to take place atomically, in some order, consistent with the real-time ordering of those operations: e.g., if operation A completes before operation B begins, then B should logically take effect after A. Source: https://jepsen.io/consistency/models/linearizable

To test this, a new ABCI application called Merkleeyes was developed. It’s a distributed key-value store. Jepsen sends a mix of read, write and compare-and-set transactions to Tendermint, plus optionally introduces failures (partitions, crashes, etc.). After that, it analyses the history for any consistency violations.

NOTE: end-to-end tests check Tendermint recovers after crashes, but they do not check the transaction history. Jepsen test suite can be viewed as an extension in this way.

In Feb. 2021, we have successfully updated this test suite to the latest Jepsen version 🚀 . The code can be found here: https://github.com/tendermint/jepsen.

Also, now we have a new GitHub workflow, which runs a given Jepsen test (one can configure a workload, nemesis, concurrency, etc.). It will be used to test all upcoming releases.

In the future, when tendermint-rs full node is ready, I expect it to take advantage of this suite.

If you want to learn how Jepsen works, I highly recommend watching this video from Carnegie Mellon University.

Future (2021–)

In the future, we are hoping to continue improving our test suite.

This will most likely include more model-based tests. E.g. model-based tests for fastsync would give us some assurance that fastsync a) always terminates b) only syncs “valid” blocks in the presence of a single valid peer. #4457

Another idea is to evaluate “Twins: White-Glove Approach for BFT Testing” #4953.

We’re always on the lookout for ways to build more confidence in the correctness of our Tendermint implementation. If you have any ideas, please reach out and talk to us!

Appendix A: Jepsen alternatives

As far as we know, there a single open-source alternative to Jepsen, Porcupine. It’s written in Go and supposed to be faster. However, we’ve ultimately decided to proceed with Jepsen for three reasons:

We already had the old test suite, written by Jepsen’s author in 2017.
Jepsen is more powerful (can do network splits, crashes, truncate WALs, etc.)
Jepsen has commercial support behind it (many DBs are using it) and it is in active development (Elle checker appeared just recently).

The only downside is Jepsen is written in Clojure and there are not many Clojure folks out there (even less who know Jepsen) 😛 .

If you have experience using Porcupine, please let us know in the comments below or reach out to me directly on Twitter.

Special thanks to Sean King and Tess Rinearson for reviewing this article 🙏