Randomized Testing of Cloud Spanner

8 min readAug 20, 2020

One of the secrets behind Cloud Spanner quality is randomized testing. SQL databases like Cloud Spanner have complex APIs. Complete unit test coverage for all the corner cases of SQL constructs and schema features, let alone their interactions, is nearly impossible. We use randomized testing to greatly augment our hand written test cases and find bugs we were not clever enough to imagine.

Here, I will give an overview of the kinds of randomness we introduce and how we leverage this randomness in both unit tests and large scale system tests. A key part of our strategy is the ability to validate the results of these random operations using a Spanner emulator and Spanner’s strong consistency guarantees.

Playing Dice

In randomized tests, we mutate Spanner databases with random transactions, which comprise inserts, updates, and/or deletes performed using the write API or DML statements. Transactions can also contain reads and/or SQL queries. We model transactions as a sequence of actions, and use modular action generators to assemble a transaction from a sequence of randomly generated actions.

The most complex action generator is the Random Query Generator (RQG), which generates random SQL queries and DML statements in the subset of GoogleSQL supported by Cloud Spanner. The RQG uses reflection on the database schema to understand the functions, operators, types and relationships between tables, so the vast majority of statements it generates are valid (i.e., this is not a fuzz tester). Reads and writes are generated using simpler generators that read and/or mutate random columns of randomly selected rows with random values.

The random rows and values used in these actions are selected using a Random Data Generator (RDG), which can pick “interesting” values for the operations. We would like the majority of transactions to succeed, but since both the schema and the kind of operation impose constraints on what values are acceptable, completely random values would cause many transactions to fail. For example, an Update mutation requires that the row already exists, so the write generator will ask the RDG for the key of an existing row for use in these mutations. Similarly, choosing a value for a column for which there is a UNIQUE INDEX requires that the value not exist in any other row. The RDG extracts a set of constraints from the schema, the actions in the transaction, and a sample/range of the current table contents, then uses a constraint solver (CVC4) to compute a set of values satisfying the constraints (if possible), falling back on random values if a solution cannot be found within a deadline.

In addition to random transactions, we also generate random schema changes using a Random Schema Generator (RSG). Tests start with a fixed schema. Periodically, the RSG takes the current schema and produces a series of random modifications to it, e.g., adding/dropping tables/columns/indexes, adding/dropping check/foreign key constraints, or modifying various schema options. The change statements themselves are usually valid, but may fail depending on the contents of the database (e.g., adding a unique index requires the indexed values already be distinct).

Finally, we also generate API calls to change the partitioning and/or replication of the tables in a random way. This should not affect any read/query results, but much of the complexity of executing transactions on Spanner results from the distribution of the data across multiple servers and the resulting coordination necessary to provide a single system image, so such changes are interesting, especially in larger system tests.

Knowing Right from Wrong

Simply running random transactions may uncover bugs that cause servers to crash or return unexpected errors, but ideally we would like to know that the results of reads and queries are consistent with the preceding sequence of mutations (our users are certainly counting on this). The relatively weak consistency model provided by most distributed databases makes this a challenging problem. By contrast, Spanner always provides external consistency: every transaction has a timestamp, and reads/queries at a timestamp will reflect all committed transactions up to that timestamp. This property allows us to validate read/query results efficiently, as we will explain below.

We have implemented an in-memory emulator for Spanner we call MiniSpanner (note that this is different from the public Cloud Spanner Emulator). MiniSpanner implements only Spanner’s data model: the schema and current contents of tables and indexes. It does not implement concurrency, distribution, replication, or the interfaces to low level storage used by Spanner. MiniSpanner is what is known as an oracle in software testing, intended to answer one question: if we apply this series of mutations to Spanner and then run this read/query, what should the results be?

Because MiniSpanner is much simpler than Spanner, we can trust it to give the correct answer (though it is also validated with thousands of hand-checked queries). A SQL join in Spanner involves selecting an up-to-date replica, streaming results between processes, handling cases where a process restarts, collecting all the data across partitions, etc. A join in MiniSpanner is just a couple of nested for loops over the maps holding the table contents.

For the Spanner read API, MiniSpanner can compute the exact set of rows and columns returned. For SQL queries, MiniSpanner computes a representation of all possible results. Some SQL constructs may not precisely determine the ordering of their results, but we can represent all valid partial orderings for result matching. Other SQL constructs (e.g., the RAND() function or a LIMIT without a sort) can produce non-deterministic results: in this case, we simply accept any possible result.

Random Walks One Step at a Time

The simplest randomized tests we run leverage a test Spanner, which has a full Spanner stack in-process, using the local disk for persistent storage. These single process tests repeat the following steps until some transaction or time limit is reached:

Generate a random transaction.
Execute the transaction on the test Spanner.
Execute the transaction on MiniSpanner.
Compare the results of steps 2 and 3. The transaction should have succeeded or failed on both test Spanner and MiniSpanner, and if the transaction performed reads/queries, the results from Spanner should be equivalent to the results from MiniSpanner.

We call this test structure parallel emulation. Some parallel emulation tests might throw in additional operations between transactions, such as random schema changes or changes to the partitioning of the data. Replication changes are not interesting for these tests since test Spanner is usually configured to keep only one replica for efficiency.

The randomness in these tests is pseudorandom, based on a seed chosen randomly at the start of the test. If the test fails for a particular seed, it can be fed back into the test to rerun the same sequence of transactions for reproducing the failure. In fact, all the transactions we generate are pseudorandom functions of a transaction key, which uniquely identifies the transaction within the test. We will use this property below.

Concurrent Random Walks

Parallel emulation does not exercise concurrency and so will not find a whole class of bugs resulting from races. For these, we run tests that have multiple threads executing random transactions in parallel against the same database. The threads are not coordinated in any way: any threads might execute a transaction that reads/writes any row of any table at any time. In this situation, how can we know the expected result of a read/query?

As we mentioned above, Spanner provides very strong consistency semantics, and this allows us to validate transactions precisely and efficiently using transaction logging:

We add a special log table to the schema that is never the target of the random transactions we generate.
When we execute a mutating transaction, we add an extra mutation recording the transaction’s key and commit timestamp to the log table. We leverage the atomicity of Spanner’s transactions here: the log is written if and only if the transaction is committed.

Whenever we commit a transaction that had reads/queries and we want to validate those results we:

Read all the logs in the log table up through the commit timestamp of the transaction.
Replay the logs, in commit timestamp order, on a MiniSpanner. Since the log contains the transaction’s key, it can be used to regenerate the mutations in the transaction.
Run the read/query on MiniSpanner and compare the results to those returned by Spanner.

Spanner’s external consistency allows our test framework to know the precise set of transactions whose effects should be visible to any committed transaction.

These tests may also perform schema changes in parallel to the transactions. Schema changes in Spanner take effect atomically at a specific timestamp. We log these changes in a separate log table and when we replay transactions in the logic above, we apply the schema change on MiniSpanner at the correct place in the sequence.

Unlike parallel emulation tests, these tests have nondeterminism from the scheduling of the threads and so failures can be more challenging to reproduce. Although we run some single process tests using this technique, we use it most heavily in our large scale system tests, which we’ll discuss next.

Random Walks at Scale

Although test Spanner is useful and can reproduce many kinds of errors, the real Spanner service (production Spanner) runs in a very different environment. Production Spanner depends on several other systems in Google’s Cloud (e.g., distributed file system), most of which are mocked out in test Spanner. Production Spanner runs across many machines and is configured to keep several replicas of any data for availability. Most importantly, production Spanner is subject to many kinds of faults, from file system errors to network issues.

To have confidence that production Spanner will serve correct results in the presence of faults, we run large scale system tests that create a Spanner instance similar to the production service and hit this instance with randomly generated load (transactions, schema changes, partitioning and replication changes) as well as faults (e.g., process crashes, file errors/corruptions, RPC delays/errors, unavailability of a region). These system test instances, backed by hundreds of processes, are much smaller than the production service, but allow us to run between hundreds of thousands and millions of transactions in a 12 hour test.

To scale up, we run transactions in many threads across many test clients. As before, these threads are not coordinated in any way. For validation, we use a sharded variant of transaction logging. Since the size of the database generated by hours of writes from many clients is far too large to fit into the memory of any given test client, we create a logical partition of the database into many log scopes such that the contents of a log scope can easily fit into memory (e.g., tens of megabytes). Conceptually, each log scope has its own log table, and we can reconstruct the state of a log scope by replaying only the logs for that log scope. At a high level, the sharded validation works as follows:

Partition the database into log scopes and create log tables.
When executing a mutating transaction, add extra mutation(s) recording the transaction’s key and commit timestamp to each log table whose log scope was mutated by the transaction.

After commiting a transaction that has reads/queries:

Read all the logs for the log scopes that were read/queried, up through the commit timestamp of the transaction.
Replay the logs, in commit timestamp order, on a MiniSpanner.
Run the read/query on MiniSpanner and compare the results to those returned by Spanner.

One important restriction is that the actions in a transaction must be confined to a single log scope. Log scopes are typically a range of rows in a table, so this is easy for the random generators to ensure. For SQL queries, we always add a WHERE clause that limits the row key to be in range of a particular log scope, so some kinds of queries (e.g., ”SELECT COUNT(*) FROM Table”) are not supported in these system tests.

By combining many kinds of random operations, randomly injected faults, relatively large scale and long run times, these tests find very interesting bugs triggered only by specific sequences or events or interactions. On the downside, they can be difficult to debug.

Conclusion

Cloud Spanner uses randomized testing to explore the enormous space of possible transactions, schemas, and system states (e.g., partitioning). These tests are in addition to unit and integration tests, performance tests, and version compatibility tests. Randomized tests are run continuously and at scale to maintain Cloud Spanner’s exceptional quality.