Reliable Systems Series: Model-Based Testing

This is the first article in a series on techniques I’ve found useful for making my projects more reliable. These techniques are used in the distributed systems, database, automotive, embedded, and aerospace fields, but if you build services, user interfaces, or generally anything stateful, I think you will find something useful along the way.

When people speak about these techniques in public, it is usually in an academic context. The organizations that tend to use these techniques for practical engineering often don’t make their internal documents, talks or slide decks available. Contrary to popular belief, we can apply most of these techniques to a broad range of projects. They can help us to ship faster and at lower cost.

But it does not take a PhD in formal methods and a budget of $10,000 per line to get rapid results that significantly improve the quality of our projects.

Model-Based Testing

Jumping right in, I’d like to start with a technique that I’m awed by daily. By describing our expectations using clear models, we can automatically generate tests with less bias and far more coverage than what a human would have the lifespan to create.

If you want to see test code right away, jump to the “Writing the Model” section below. The sections before that will talk about QuickCheck, justifications for using property testing, and the general thought process around combining models with property testing.

When I say “model” in this context, I don’t mean a formal specification or anything. I mean a bit of code that behaves in a way that can be used to track our expectations of the thing we’re testing. For a database, we might be able to model our expectations of what get and insertrequests should return by using a hashmap to act as a “double checker”.

Once we have a model, we generate a sequence of operations to be applied to both the model and our implementation. When they diverge in behavior, we know we have a bug in either our implementation, our model, or both! That’s it!

It’s OK to Feel Skeptical

If you are anything like the dozens of friends that I have excitedly explained this technique to, you are probably skeptical at this point. It is entirely natural to feel this way until the actual moment where this technique gives you a specific sequence of operations that causes a bug to jump out in some code you’re writing. When that happens, there is a good chance you will feel like you’ve had one of those magical programmer enlightenment experiences that completely changes how you think about everything thereafter. It definitely has for me!

This technique can uncover incredibly subtle bugs, in large part because it generates random sequences of operations without the biases of the human that created the implementation.

But the even more magical part is what happens when you combine this technique with shrinking functionality. With only about 50 lines of total testing code you can not only have a powerful fuzzer-like test, but it will automatically simplify a failing test that it finds, too. This gives you a minimized sequence of specific operations that reproduce your bug! It will write your regression tests for you.

In our example below, we will use QuickCheck to generate sequences of operations. We will apply each operation to our model and our implementation. When we notice divergeance between the two, our QuickCheck property fails. QuickCheck will randomly remove operations from the sequence until it finds a minimized subsequence that still causes our model and implementation to diverge in behavior. Then we have a new regression test that should reproduce our issue! A gift from the machine!!!

What Is QuickCheck?

(Feel free to skip down to the section titled “QuickCheck + Models = ❤” if you’re already familiar with QuickCheck!)

QuickCheck is a popular flavor of property testing library, with implementations in many languages. Property testing (also known as generative testing), is a kind of software testing where you write a function that tests whether a given property holds for randomly generated instances of that function’s arguments. For example, one could write a property is_false(input: bool) -> bool where we simply return the input value. This property will fail to hold when the input false is provided as an argument. A more realistic property might be one that asserts that serialization code is able to convert an object into JSON and back again such that it produces an equivalent value to the original input. QuickCheck will provide random instances of the object that you want to serialize, and if deserialize(serialize(object)) != object then our property will fail, letting us know that our implementation is unsound for the particular object that QuickCheck provided as an argument.

QuickCheck implementations usually allow users to provide “generators” that produce an instance of a desired input. It then feeds this generated input to the property function, and if the property fails to hold, QuickCheck will tell us about the specific instance that failed. Good implementations will also include “shrinking” functionality that can simplify failing inputs. This lets us generate complex inputs for testing our property against, like a long sequence of random client requests, then boil it down to a specific failing subsequence, making it easy to debug. QuickCheck just wrote a regression test for us :)

Why is this useful?

  1. We are essentially building a fuzzer, dramatically increasing the coverage that is possible with a relatively small amount of test code.
  2. QuickCheck will generate test cases that you may never have thought to check in a more traditional unit test. But once QuickCheck has found a failing case, you can copy and paste the input into the body of a new regression test. It writes your regression tests for you, and the shrinker keeps them simple! Your new machine-generated regression tests are deterministic to the extent that your implementation is deterministic.
  3. By randomly generating more complex inputs, our tests are less vulnerable to the pesticide paradox, and longer test runs can keep turning up bugs over a long period of time, keeping your engineers busy with deterministically replayable (far easier to fix) bugs.
  4. The generated QuickCheck test cases are not necessarily deterministic (look at how the random number generator is seeded for the implementation available to your languages of choice), so we can basically get the best of deterministic and non-deterministic testing by turning failing inputs into regression tests. This is particularly nice for inputs that fail less frequently, as QuickCheck may only generate a few dozen to a few thousand inputs by default. This is almost always configurable, which is particularly nice for running longer burn-in tests when desired.
  5. Because we, the implementors, are not thinking up specific test cases, our biases do not impact the test quality as much. The machine is only impacted by the biases of our generators, which is usually a significant improvement. QuickCheck frequently generates test cases that amaze me, both in terms of the subtlety of the bugs that they uncover, and the seemingly-magical ability to shrink the failing case down to something I can reason about easily to fix the bug.

QuickCheck + Models = ❤

So, how do we apply QuickCheck to a big nasty system? This was unclear for me for a long time. Some people I respected had been praising it for finding interesting bugs in a complex system they were building, and I wanted to apply it to a service I was working on. But it wasn’t clear how I could apply it to anything more complex than serialization libraries. For me, the turning point was when I started working in Erlang at work, and was exposed to the way people often do stateful modeling in the Erlang ecosystem. Reading a paper by John Hughes, Experiences with QuickCheck: Testing the Hard Stuff and Staying Sane was also a real eye-opening experience for me, as was Testing Monadic Code with QuickCheck[warning: postscript]. This post is essentially trying to present the techniques in those papers in a way that people outside the Erlang and Haskell communities can readily apply, even if they don’t have cool stuff like PULSE at their disposal.

The approach I’ll outline is language-agnostic, but my examples will be in Rust. If the language you’re using at the moment has a nice QuickCheck implementation (particularly one with shrinking), it will probably make your life a bit easier, but it’s not a requirement that you use QuickCheck for this at all.

QuickCheck is only used here as a library for generating a sequence of random inputs and then shrinking the failing sequence to a smaller, more easily understandable sequence if a failure is detected. If QuickCheck does not exist for the language you’re using right now, you may be able to write something suitable in 25–100 lines of code. Some QuickCheck implementations are capable of doing much more than what we’ll be using it for, but here we are focusing on attacking our own mental models in a lower bias way, rather than the specifics of the tool used for demonstration.

I hope this post finds its way to the next wave of folks who want to make stuff with fewer bugs, and are lucky enough to be working with the freedom to invest in reliability. I want to live in a future of your making.

Enough Blabbering, here’s the recipe! DO IT NOW!

  1. Decide to implement a thing.
  2. Create a toy model of how it should work.
  3. Write a test asserting identical behavior between your implementation and your toy model, given a randomly-generated sequence of operations that will be applied to both.
  4. Build the thing. when the model and implementation diverge, your assumptions have been invalidated. Your implementation may be incorrect, your model may be incorrect, or both are incorrect.

Building a Tree

Let’s go! Building a simple tree will illustrate this technique nicely. I’m going to use lightly-edited code that Ethan Frey wrote the other day while we were learning about Rust together. Here’s a simplified skeleton for the Tree that will compile on Rust 1.21.

The code for this article is also available at https://github.com/spacejam/quickcheck-tut/.

src/lib.rs:

If you’d like to follow along in your text editor, but haven’t used Rust before, I recommend installing it by piping this completely harmless shell script into your terminal. Options are values that may or may not actually be present, like a well-typed null/nill/undefined. Boxes are values that are on the heap, and may persist after the function that created them returns, if the function returns them. Ord means you can compare the ordering of two values against each other, which is useful for traversing and inserting items into trees. <K, V> means that the key and value types can be anything that fits the other requirements (the Key type needs to be Orderable). #[derive(Default)] lets us automatically generate a constructor that creates a fresh Tree for us.

Writing the Model

Now let’s write the model! But… how? This is the interesting part.

The model has a dual purpose. First, it is to identify divergence between your intentions and your implementation. But it can also be extremely helpful as a cognitive aid for reasoning about what your system actually does. It can often be a great thing for a new engineer working on a system to use to learn how a complex system should behave in different situations, without having to ford outdated documentation or thousands of lines of code written under tight deadlines with readability taking a seat behind functionality. So, don’t skimp on code quality “because it’s just test code”. If written clearly, it could serve as great guide to the system for anyone trying to reason about its behavior.

For our use case of a binary tree, the model is actually pretty easy. We are writing a tree. Rust has a well-tested tree built-into the standard library, BTreeMap, so we can use it as a “reference” model to check against. If we were writing a database, BTreeMap might also be a useful reference for that. If you just have a stateful service that stores stuff, maybe a HashMap is all you need. If you’re testing an LRU cache, maybe you can model it using a Vec and just do an O(n) scan over it when you want to push an element to the recently-accessed end of the cache.

The goal here is not performance at scale, but the ability to hold the model in your mind with ease. It’s OK to do O(N²) stuff here when N is guaranteed to be small by the input bounds of the test. The less brain cycles on your part, the better. Let the machine suffer if it means you can get to the root of surprising behavior faster. You will need as much free brainpower as possible for debugging the slew of problems these potent tests may send our way.

A model is often the first thing I write before starting the implementation of many of my projects. By letting the model guide the implementation, it’s TDD with a single test. When it goes green, you should ask yourself if the model fully describes the problem you’re trying to solve.

To get underway, let’s create a thing that represents different kinds of operations on our Tree :

enum Op {
// our K and V are both of type u8
Insert(u8, u8),
Get(u8),
}

Now let’s generate a random sequence of them, using QuickCheck, and then apply each operation to both the model and our implementation, returning false from the property function implementation_matches_model when they diverge:

tests/quickcheck.rs:

and, don’t forget, a Cargo.toml file in the root directory of the project, containing our dependencies:

[package]           
name = "btree"
version = "0.1.0"

[dev-dependencies]
quickcheck = "0.6"
rand = "0.4"

Here’s how our overall directory structure should look:

λ tree                    
.
├── Cargo.toml
├── src
│ └── lib.rs
└── tests
└── quickcheck.rs
2 directories, 3 files

Let’s see these tests in action!

λ cargo test                                                                                                     
Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
Running target/debug/deps/btree-9ef8f49ee1976a3b

running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running target/debug/deps/quickcheck-dc970f46510fa339                                                       

running 1 test
test implementation_matches_model ... FAILED

failures:
---- implementation_matches_model stdout ----
thread 'implementation_matches_model' panicked at '[quickcheck] TEST FAILED. Arguments: ([Insert(0, 192), Get(0)])', /home/t/.cargo/registry/src/github.com-1ecc6299db9ec823/quickcheck-0.6.0/src/tester.rs:171:27
note: Run with `RUST_BACKTRACE=1` for a backtrace.
failures:
implementation_matches_model
test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out
error: test failed, to rerun pass '--test quickcheck'

This is a lot of text. The most interesting part for me is:

[quickcheck] TEST FAILED. Arguments: ([Insert(0, 192), Get(0)])

This is telling us that after we insert a value into both the implementation and the model, get operations for that same key, 0, diverged. Our BTreeMap model returned a different result than our implementation. At least one of them must be wrong! Maybe we should implement those stubs :P We have just mechanically generated our first regression test that we can work against if we so choose.

All Watched Over By Machines Of Loving Grace

Let’s take a step back and appreciate what just happened. We wrote around 50 lines of test code, and it is capable of exercising a huge number of possible codepaths, while also verifying that the code behaves according to a simplified model that we can use for reasoning about the system.

Even more awesome, is that when it finds a sequence of 100 operations that cause a divergence between your implementation and your model, it will reduce those 100 operations to the minimum ordered subset of those same operations required to cause the divergence, making the issue far easier to debug. Out of the 100 operations in the test it first ran (and failed), it narrowed them down to the specific 2 that trigger the divergence! It does this by dropping out specific elements of the sequence and running the test again, until the test passes or there are no more untried elements to remove. Implementations may vary, but should roughly match this behavior.

I frequently copy the failing sequences that QuickCheck found as standalone regression tests by plugging them directly into the property function every time my tests are run. This way I get some nice nondeterminism that randomly increases the coverage over time, while deterministically checking to make sure that the bugs I fixed in the past have not broken again in fun and exciting ways.

Tests like this are gifts that just keep giving. They don’t just test for the bugs you can imagine. By working with randomly generated test cases, we can take a significant amount of human bias out of testing. It can be extremely humbling the first time you add QuickCheck like this to a module you’ve written, and you spend the next few days, weeks, or even months chasing down all of the bugs it manages to shake out, but you will probably emerge on the other side as a stronger engineer.

Oh yeah, and if you’re the kind of person who hates writing tests, you just punted almost all of the work to a machine! It can feel downright eerie sometimes how quickly a complex bug that you would never have thought to write tests for is discovered with the help of a model-based test like this.

You Have Been Challenged!!!!

¡¡¡¡If you are up for a challenge, I challenge you to get this tiny test to pass by implementing the Tree’s get and insert methods!!!!

Unless you spend all day long writing verified data structures in Rust (all 2 of you :P), I expect that this test will find several surprising flaws in your implementation as you go along! I certainly didn’t get very far before this was the case, despite feeling like I know trees fairly intimately!

Dealing with New Abstractions

Sometimes there isn’t such a clean off-the-shelf data structure, library, or even external system you can use as a reference model from your QuickCheck test. Think about the properties you actually care about. Sometimes it’s much easier to write two or more simple models that track specific system behavior than trying to build a model-of-everything that ends up looking a lot like a second implementation of the system.

Sometimes you also don’t have a single possible expected result that the implementation should produce, but you can still construct a range of valid possibilities. For my database, I have a model-based test that will inject failures around specific disk operations and see how the system reacts overall. Sometimes it’s not clear whether a specific value will be present or not after simulating a crash, so the model stores a boolean communicating certainty along with the possibly expected value.

You might not need to do anything other than check a few invariants, like “only one master has the ability to publish state to a majority of the system” or “the data before a shutdown matches the data after a restart.” I got a ton of mileage out of a single test for a CRDT library by changing the number of participating fake nodes in the system and expecting the state to converge correctly for all nodes after applying a sequence of randomly applied mutations from different nodes.

If it’s hard to model or check specific invariants of a system, it could be a symptom that the architecture is unsound, leading to unreliable implementations and high human costs for debugging over time. If you can’t describe it in a model or set of invariants, good luck getting humans to understand what should be happening.

After getting hooked on the technique of testing systems with operations/events applied to them, I’ve tended to architect my systems in ways that can be “single-stepped” and then “replayed” when issues are found. This means restricting nondeterministic behavior or increasing the observability of causal relationships at runtime whenever possible. We don’t have to eliminate nondeterminism completely, but we do want to replay things in an order that will cause the issue to reliably show up again so that it can be verified to be absent after an attempted fix. But usually it’s far easier to attack nondeterminism than to intervene dynamically at runtime to induce a particular causal partial ordering for reproducing the issue, so choose wisely.

I have a personal organizer with a terminal UI that I test only with a single QuickCheck test that spews random inputs, and the code is littered with asserts. In some ways this is like using fuzzing to find ways to trigger memory corruption. I’ve found a good QuickCheck model can yield a ridiculously high test code:test coverage ratio. A single model-based test can replace most TDD-style unit tests, with far less code and far higher coverage. It also pushes you to prioritize the quality of error information that your system outputs, as you may rely on this more to get to the bottom of interesting bugs uncovered by the machine.

That’s it For Now!

Model-based testing is an extremely powerful technique. It tests far more possibilities than we could have imagined, avoiding our biases as the creators of the system. It automatically generates regression tests, saving us lots of time trying to reproduce subtle issues. It can be used as an incredibly powerful TDD technique that lets you basically write one test to build the prototype from. It punts so much work to the machine to do on our behalf, and gives us a nice feedback loop when things don’t work out as we expected. It has dramatically improved my confidence in my own projects in the database, user interface, and distributed systems spaces.

Thank you for your interest in this, I really want to live in a world where more people are familiar with these techniques for creating projects that are more reliable :)

In my next post in the Reliable Systems series, I will focus on simulation: a powerful technique for building hardened distributed systems. We will write an implementation of CASPaxos, using QuickCheck to generate partitions and delays that simulate realistic network conditions. We will write a property that asserts clients observe linearizable histories. All while running the system in accelerated time such that we can run several orders of magnitude more tests in a couple seconds (on one laptop) than if we were to spin up a Jepsen cluster.

Many thanks to Adam Crane, Peter Kolloch, Matthias Nehlsen, Alex Clemmer, Daiyi and Yoshua Wuyts for putting up with my (even more) painfully unfocused early version, and giving me the feedback that ultimately led me to split this into an entire series, and Ethan Frey for letting me use the code from our Rust pairing session for this.