Property-based testing: what it is and how we use it at Accurx

Published in

Accurx

14 min readMar 28, 2024

Property-based testing is a testing paradigm that lets us describe “properties” of our code that should hold true for any valid input. It’s a powerful tool, but one for which I personally have rarely encountered a good use case. However, when writing the engine that powers Accurx’s new “Unified Inbox”, we recently encountered just that.

In this post I’ll cover:

What property-based testing is
When it’s useful and when it isn’t
How and why we use it at Accurx
The gnarly bug it helped us uncover

Example-based testing

When most of us think about writing tests, we are usually thinking of “example-based tests”. We come up with a bunch of examples and write a test for each one. We think really hard about all the different kinds of inputs our code might encounter, and we come up with an example for each one. We hope we’ve thought of all the edge cases that might break our code 🤞🏻🤞🏻🤞🏻.

For example, given a concatenate function like this one:

export const concatenate = <T>(xs: T[], ys: T[]): T[] => {
    const result: T[] = [];
    for (const x of xs) {
        result.push(x);
    }
    for (const y of ys) {
        result.push(y);
    }
    return result;
}

We might write unit tests like this:

describe("concatenate", () => {
    test("it can concatenate two empty lists", () => {
        expect(concatenate([], [])).toEqual([]);
    });

    test("it can concatenate an empty list with a non-empty list", () => {
        expect(concatenate([], [2, 3])).toEqual([2, 3]);
    });

    test("it can concatenate two non-empty lists", () => {
       expect(concatenate([1], [2, 3])).toEqual([1, 2, 3]);
    });

    test("it can concatenate lists containing duplicates", () => {
       expect(concatenate([1, 1], [2, 3])).toEqual([1, 1, 2, 3]);
    });
});

Again, we do this by thinking really hard about all the different kinds of inputs our code might encounter, and coming up with an example for each one.

Property-based testing

Property-based testing, by contrast, allows us to focus on thinking about the properties we want to hold true for our code. What’s great about this is that we don’t have to think up the edge cases that might cause these properties to break. As a result, “property-based testing” can sometimes find bugs that “example-based testing” can’t.

We write a property based test by creating “arbitraries” — this is property-based testing jargon for “generators” that generate inputs to our functions — and then a function that takes the result of those arbitraries, calls our code with them as inputs, and makes assertions on the result.

For example, if we were to write property-based tests for our concatenate function, they might look like this:

describe("concatenate", () => {
    test("the length of concatenated lists should be the sum of the length of its inputs", () => {
        fc.assert(
            fc.property(
                // we specify two "arbitraties"
                fc.array(fc.anything()), // a random length array with random contents
                fc.array(fc.anything()), // a random length array with random contents

                // and write a function that takes values generated by those arbitraries...
                (xs, ys) => {
                    // ...runs it through our code...
                    const zs = concatenate(xs, ys);
                    // ...and asserts that the given property holds true
                    expect(zs.length).toEqual(xs.length + ys.length);
                }
            )
        );
  });

  test("the result of concatenate should contain every element of its input lists", () => {
      fc.assert(
          fc.property(
              // we specify two "arbitraties"
              fc.array(fc.anything()), // a random length array with random contents
              fc.array(fc.anything()), // a random length array with random contents
        
              // and write a function that takes values generated by those arbitraries...
              (xs, ys) => {
                  // ...runs it through our code...
                  const zs = concatenate(xs, ys);
                  // ...and asserts that the given property holds true
                  expect(every(xs, (x) => zs.includes(x)));
                  expect(every(ys, (y) => zs.includes(y)));
                }
            )
        );
    });
});

NB. The examples in this post use fast-check, a Javascript library for property-based testing.

Given these tests, the testing framework will generate a ton of random inputs. It will then call our concatenate function with each set of inputs it generates and assert that the property we are testing holds true.

Here, we specify that the inputs to our concatenate function should be two arrays of random length, with random contents, using the fast-check “arbitraries” fc.array(fc.anything()), so we’ll get a mix of empty arrays, long arrays, short arrays, arrays containing numbers, strings, nulls, booleans, objects, other arrays, anything you can think of. All the edge cases we thought of for our example based tests will be covered, along with many more.

I’ve called these inputs “random”, but that’s not strictly true. There is randomness in how they’re generated, but they are also adversarial: creators of property-based testing frameworks know the kinds of inputs that will often break code in the language they’re testing, so they make sure their frameworks generate those kind of inputs first. We’ll see some interesting examples of this later on.

Shrinking

When a property-based testing framework finds an input that causes a property to fail, it doesn’t stop there. Because the inputs generated by the framework can be very large and messy (to give the best chance of breaking something), looking immediately at the input that caused the test to fail is likely to be pretty confusing, and not that helpful in diagnosing why the test failed. Instead, property-based testing frameworks try to “shrink” the input that caused the failure. “Shrinking” is a process of progressively trying to make the input smaller and simpler while still causing the property to fail.

For example, if the input that caused a property-based test to fail was an array, the framework might try removing elements from that array to see if the test still fails. If the array contains numbers, it might try changing those numbers to 0 to see if the test still fails. In this way, the framework tries to pare down the input to the simplest possible input that will still make the test fail. This makes it much, much easier to diagnose why the test is failing.

Let’s look at a couple more examples:

Testing an `includes` function

For the sake of this post, let’s pretend that Javascript arrays don’t already have an includes method we can use, and we need to write our own. We might start with an implementation that looks like this:

const includes = <T>(xs: T[], v: T) => {
    return !!xs.find((x) => x === v);
};

Seems reasonable? There’s actually a bug in that code which, if you’ve felt this pain before, you might be able to guess…

Let’s write some property-based tests for it:

import fc from "fast-check";

describe("includes", () => {
    test("it returns true if the value is found in the list", () => {
      fc.assert(
        fc.property(
          fc.array(fc.anything()),
          fc.anything(),
          fc.array(fc.anything()),
          (prefix, v, suffix) => {
            const xs = [...prefix, v, ...suffix];
            expect(includes(xs, v)).toEqual(true);
          }
        )
      );
  });

  test("it returns false if the value is not found in the list", () => {
      fc.assert(
          fc.property(
              fc
                  .anything()
                  .chain((x) =>
                      fc.tuple(
                          fc.constant(x),
                          fc.array(fc.anything().filter((v) => v !== x))
                      )
                  ),
                ([x, xs]) => {
                    expect(includes(xs, x)).toEqual(false);
                }
            )
        );
    });
});

These tests will flush out the bug pretty quickly. Depending on the random seed the tests ran with you’ll probably end up with one of these failing inputs:

includes([0], 0)          // => false
includes([false], false)  // => false
includes([""], "")        // => false

See the bug now? Caught out by Javascript’s “falseyness” again 😢.

How do we fix this? We need to deal with falsey values specifically, perhaps by doing something like this:

const includes = <T>(xs: T[], v: T) => {
    for (const x of xs) {
        if (x === v) return true;
    }
    return false;
};

Testing a `groupBy` function

Let’s write and test another function. This time, a function that groups together objects in an array based on a key in those objects. (Again, let’s pretend this function doesn’t already exist in lodash and other libraries).

Here’s our function definition:

const groupBy = <T>(xs: T[], f: (x: T) => string): Record<string, T[]> => {
    const result: Record<string, T[]> = {};
    for (const x of xs) {
        const k = f(x);
        result[k] = result[k] ?? [];
        result[k].push(x);
    }

    return result;
};

Seems reasonable enough. Let’s write some property-based tests for it:

import fc from "fast-check";
import { every, values } from "lodash";

describe("groupBy", () => {
    test("every element of the input list can be found in the output object", () => {
        fc.assert(
            fc.property(fc.array(fc.record({ k: fc.string() })), (xs) => {
                const grouped = groupBy(xs, (x) => x.k);
                expect(values(grouped).flat().length).toEqual(xs.length);
                expect(every(xs, (x) => grouped[x.k].includes(x)));
            }),
        );
    });
});

Running this test might give you a shock — it did me. Here’s the kind of failing case I encountered:

groupBy([{ foo: 'toString' }], (x) => x.foo)  // => uh oh
groupBy([{ foo: '__proto__' }], (x) => x.foo) // => 🙅🏻‍♀️🙅🏻‍♀️🙅🏻‍♀️

A helpful reminder that javascript objects are not just data structures, but objects, with methods and other special case reserved keys, and that using untrusted user data for the keys in an object can have unpredictable results… 😬

How do we fix this? The bug here isn’t in our groupBy function per se - it’s not reasonable for groupBy to have to deal with these cases. Instead we can change our test arbitraries to not include these special case strings, and make it the responsibility of client code to make sure it never tries to use user-provided data as the keys to objects.

When should you use property-based tests?

When should you use property-based tests? Most of the time, you shouldn’t. Property-based tests execute hundreds of times for each test run. If all your tests were property-based tests, your test suite would be incredibly incredibly slow.

Where property-based tests shine is when you have complex logic and simple properties that you always want to hold true.

For example, serializing a Javascript object to a YAML file is pretty complicated. So is deserializing a YAML file back into Javascript. But a simple property you could reasonably want to hold true is that anything you write to YAML, you can read back from YAML. As such, this is a good candidate for property-based testing.

(This is a real-life example: you can read the author of fast-check's description of finding a bug in a popular javascript YAML library, along with other useful info on property-based testing, here.)

Where does Accurx use property-based tests?

We use property-based tests to test one very specific part of our codebase: the Redux store that powers our new “Unified Inbox”. We use them here because these reducers have complex logic for which we want some simple properties to hold true.

The most important reducer used to update the inbox’s Redux store is called processUpdates: it is responsible for processing updates to several different entities from a variety of sources - both API responses and websocket events - and merging them into the Redux store in such a way that the client has a consistent and correct view of the world.

To achieve this, there are two very important properties it must uphold:

it must be idempotent, as we may receive duplicate updates, due to at least-once delivery
it must be resistant to updates coming in out-of-order, as websocket events and API responses can come in in different orders for a variety of reasons

Because Redux asks you to model your reducers as pure functions over immutable state, it turned out it was very easy to write property-based tests for these properties. The hardest part was writing a custom “arbitrary” to generate valid updates; once we had that, the tests themselves were easy.

Here’s the test that checks our updates are idempotent:

describe("Idempotency", () => {
    test("processUpdates calls are idempotent", () => {
        fc.assert(
            fc.property(
                // a random seed
                fc.nat(),
                // a random length array of randomly generated "inbox updates"
                fc.array(gen.inboxUpdates()),
                (seed, updatesArray) => {
                    // we use a pseudorandom int generator with a seed
                    // so test runs are replicable
                    const random = pseudoRandomIntGenerator(seed);

                    // we create two redux stores
                    const store1 = createStore();
                    const store2 = createStore();
                    for (const [, updates] of updatesArray) {
                        // we dispatch each update a single time to store1
                        store1.dispatch(actions.processUpdates(updates));
                        // we dispatch each update between 1 and 3 times to store2
                        const duplicateCount = random(4, 1);
                        for (let i = 0; i < duplicateCount; i++) {
                            store2.dispatch(actions.processUpdates(updates));
                        }
                    }
                    // and we assert that both store1 and store2 have
                    // ended up with exactly the same state
                    expect(store1.getState()).toEqual(store2.getState());
                },
            ),
        );
    });
});

And here’s the test that checks our reducer can handle out of order updates:

describe("Out of order updates", () => {
    test("processUpdates with out of order calls will converge on the same `conversations` state", () => {
        fc.assert(
            fc.property(
                fc.nat(), // a random seed
                fc.array(gen.inboxUpdates()),
                (seed, updatesArray) => {

                    // we use a pseudorandom shuffler with a seed
                    // so test runs are replicable
                    const shuffle = pseudoRandomShuffler(seed);
                     // we create two redux stores
                    const store1 = createStore();
                    const store2 = createStore();
                    // we dispatch each update in order to store1
                    for (const [, updates] of updatesArray) {
                        store1.dispatch(actions.processUpdates(updates));
                    }
                    // we shuffle the updates, then dispatch them
                    // in random order to store2
                    for (const [, updates] of shuffle(updatesArray)) {
                        store2.dispatch(actions.processUpdates(updates));
                    }
                    // and we assert that both store1 and store2 have
                    // ended up with exactly the same state
                    expect(store1.getState()).toEqual(store2.getState());
                },
            ),
        );
    });
});

What challenges did we face?

I mentioned that the hardest part of writing these tests was creating a custom “arbitrary” for generating random updates to the state. This was challenging because we needed to create a bunch of different entities — conversations, patients, users, and more — and link them all together. This is technically possible with fast-check using fc.chain(), which allows you to generate aribtraries based on the result of other arbitraries, but in our experience it made the tests unusably slow. So in the end we went for a less that ideal slightly hacky solution, in which we generated the entities separately and then combined them using fast-check's map. This is a slightly simplified version of what we ended up with:

export const inboxUpdates = ({
    minConversations,
}: { minConversations?: number } = {}) =>
    fc
        .tuple(
             // random seed so we can have replicable test runs
             // with pseudorandom behaviour
            fc.integer({ min: 0 }),

            // randomly generated arrays of conversations,
            // patients, users and teams
            fc.record<InboxUpdates>({
                conversations: fc.uniqueArray(conversation(), {
                    minLength: minConversations ?? 0,
                    selector: (c) => c.id,
                }),
                patients: fc.uniqueArray(patient(), {
                    minLength: 1,
                    selector: (c) => c.patientId,
                }),
                users: fc.uniqueArray(user(), {
                    minLength: 1,
                    selector: (c) => c.id,
                }),
                teams: fc.uniqueArray(team(), {
                    minLength: 1,
                    selector: (c) => c.id,
                }),
            }),
        )
        .map(([seed, updates]): [number, InboxUpdates] => {
            /**
             * This mapping function links up the generated conversations
             * with the generated users teams and patients.
             * We have to do it this way rather than using fc.chain
             * because fc.chain causes issues with shrinking, making
             * the tests time out. (See <https://fast-check.dev/docs/core-blocks/arbitraries/combiners/any/#chain>
             * for details.)
             * So we start by generating conversations
             * that could have any random assignee.id or patientId
             * values, and then we overwrite them with valid ids from the
             * generated patients, users, and teams.
             *
             * The end result is a consistent set of concierge updates,
             * and a random seed, which we also return so it
             * can be logged by the test on failure.
             */
            const random = pseudoRandomIntGenerator(seed);
            const { users, teams, patients } = updates;
            const conversations = updates.conversations.map((c) => {
                let assignee: AssigneeSummary;
                let patientId: string | null;
                if (c.assignee.type === "User") {
                    assignee = {
                        type: "User",
                        // pick a random user as the assignee
                        id: users[random(users.length)].id,
                    };
                } else if (c.assignee.type === "Team") {
                    assignee = {
                        type: "Team",
                        // pick a random team as the assignee
                        id: teams[random(teams.length)].id,
                    };
                } else {
                    assignee = c.assignee;
                }
                if (c.patientId !== null) {
                    // pick a random patient
                    patientId = patients[random(patients.length)].patientId;
                } else {
                    patientId = null;
                }
                return { ...c, assignee, patientId };
            });
            return [seed, { conversations, patients, users, teams }];
        });
    });
});

What bugs did we find?

We found one or two bugs with dealing with out-of-order updates, but nothing major. One slightly frustrating thing about property-based testing is that it often throws up “bugs” caused by invalid inputs that you know are never going to happen in real life, so (in most systems) would not be worth coding defensively around. So at first, though we were happy that we had the extra level of safety and confidence the tests provided, we weren’t blown away by the results: we hadn’t found anything particularly significant. Then, about a week later, one of our property-based tests started failing on a PR…

The PR made changes to an area of the code that is particularly fiddly. When users visit particular areas of the inbox, we show them a “conversation list”: a subset of their conversations with a specific filter and sort applied. For performance reasons, we keep this list sorted in the Redux store, and when we receive new or updated conversations we insert and remove them using custom binary search functions that make use of lodash's orderBy function - which allows us to do binary search across multiple fields in both ascending and descending order.

The test asserted that, no matter what order we received conversation updates in, and no matter how many duplicate updates we encountered, we would always end up with the conversations in the correct order and with no duplicates.

But all of a sudden, we started seeing duplicate conversations causing the test to fail. In particular, we were seeing duplicates when sorting a list of conversations by comparator functions that could return false.

Why were our comparator functions returning false? Our inbox handles a bunch of different kinds of message, including individual messages, batch messages, and scheduled messages. Some of our sort comparators make sense for some of those message types, but not all. For example, we might want to sort our messages by scheduledSendDate - a field that’s only present on scheduled messages. Because our sort comparators have to work on any message type, in cases where the comparator doesn’t really make sense, we chose an arbitrary value to return instead: false.

This, it turned out, was a bad idea. Read on to find out why…

A bug in `lodash`?

I love lodash's orderBy function. It’s a really succinct way of expressing multi-field ordering that allows you to specify different sort directions for the individual fields. For example:

const xs = [
    {a: 4, b: null},
    {a: 5, b: null},
    {a: 7, b: "something"},
    {a: 1, b: "something else"}
]

_.orderBy(xs, ['b', 'a'], ['asc', 'desc'])
// returns:
[
    {a: 7, b: "something"},
    {a: 1, b: "something else"},
    {a: 5, b: null},
    {a: 4, b: null}
]

The above code sorts xs first by the field b ascending, and then a descending. Two of the b fields are null, and though sorting null with some strings arguably doesn’t make complete sense, lodash picks a reasonable default - strings first, then nulls - and sorts the input consistently by this value first, and then the a field (in descending order).

A very different thing happens if you try to sort by a field which could be either a string or false, however:

const xs = [
    {a: 4, b: false},
    {a: 5, b: false},
    {a: 7, b: "something"},
    {a: 1, b: "something else"}
]

_.orderBy(xs, ['b', 'a'], ['asc', 'desc'])
// returns:
const xs = [
    {a: 4, b: false},
    {a: 5, b: false},
    {a: 7, b: "something"},
    {a: 1, b: "something else"}
]

_.orderBy(xs, ['b', 'a'], ['asc', 'desc'])
// returns:
[
    {a: 7, b: "something"},
    {a: 5, b: false},
    {a: 4, b: false}
    {a: 1, b: "something else"},
]

What is going on here?? Well it turns out that if you try to order by a field which could contain a string OR false, lodashcompletely ignores that field when sorting. As you can see from the example above, the result is ordered by the a field only. b is not taken into account at all.

While it’s reasonable (again) to say that there’s no canonically “correct” way to sort an array containing strings and falsevalues, at least in the string + null example above, lodash picked an option and was consistent with it.

The fact that orderBy ignores fields that includes both string and false values completely broke our binary search functions - which was what led to duplicates appearing in our conversation list. The fix: raise an issue on the lodash repo and change our comparator functions to return a different arbitrary value for messages for which the comparator doesn’t make sense: null.

We would never have thought to write an example based test for this. It would have been an extremely rare edge case in production. But if/when it had happened, we would have been completely lost trying to diagnose it.

Conclusion

Property-based tests are a powerful tool for flushing out edge cases you would never have thought of yourself. They are a completely different testing paradigm to the typical example-based unit tests most of us are used to. However, they are best used only when you have complex logic for which simple properties must hold true.