Breaking the database CI speed limit with Spawn

Published in

Spawn DB

5 min readFeb 8, 2021

It can often feel like your continuous integration (CI) environment has an artificial speed limit on it when you’ve got databases involved. Databases sat on infrastructure to support CI are likely to slow you down due to contention for the resource, or artificially limiting access to prevent concurrent tests from interfering with one another.

Pets, not cattle

A lot of the time, the above can be explained due to the database being treated as a pet, not cattle. Hand-crafted, long-lived environments with carefully curated test data tends to be the most obvious way to create these kinds of CI-supporting databases as there isn’t really a better solution — particularly when that CI database might contain huge volumes of data.

The trend towards databases-as-cattle in CI is improving with growing adoption of containerised database environments. But this still comes with its own set of problems:

Setup code needs to be written to bring these up
Care needs to be taken to tear these down once finished with
Networking needs to be set up appropriately to connect to them
Data volumes need to be somehow copied in, or restored when the instance comes up
Each database engine has its own set of quirks for starting up, including appropriate “readiness checks”

It can take a while to get all of the above right, and sometimes not everything can be solved adequately. Particularly problems around large datasets — even containerisation can be tricky as the data volume needs to be handled carefully and replicated for isolation.

Cattle, not pets

This is where Spawn can help. Spawn is specifically built for instantly provisioning databases from code in your source control repository, regardless of the size of the data in that database. It also introduces a common paradigm for provisioning database instances for multiple database engines, so there’s no need to know the intricacies of how to bring these different environments up.

We’ve already seen Spawn enable a team to reduce their CI setup code by 50%, and seen a reduction in overall test time by 30% — this blog post will go into how that has been achieved.

Hooking up Spawn in CI environments

Let’s take a look at an example of how Spawn can be configured in a CI pipeline.

This example application is a straightforward 3-tier app (React frontend, .NET Core Web API, backed by a Microsoft SQL Server account database and a PostgreSQL application database):

The example application we’re building a CI pipeline for, which will use Spawn

The details aren’t hugely important, but we’ve got a variety of tests written that need to leverage the database. We want to run those integration and migration tests in CI, and this is where Spawn can help.

Provisioning databases for schema migration tests

Testing that your schema migrations on a branch will successfully apply to production without data loss is crucial, but doing this can often be hard as dev or CI environments rarely contain the same data as production. Restoring backups of prod is possible for these kinds of tests, but it can take an extremely long time to do it depending on the overall size.

In our case, production backups are taken on a nightly basis. But the big difference is these are used to create Spawn Data Images that allow instant provisioning of unlimited copies of the database regardless of the overall size (called Spawn Data Containers).

Data Images are defined via a YAML file and these are recreated on a nightly basis so the CI tests have an up-to-date copy of the production database.

We can now use spawnctl in a script to provision these databases for use in our CI pipeline:

todoContainerName=$(spawnctl create data-container --image todo-list:latest -q --lifetime 15m)todoConnectionString=$(spawnctl get data-container $todoContainerName -o json | jq -r '.connectionString')accountContainerName=$(spawnctl create data-container --image account:latest -q --lifetime 15m)accountConnectionString=$(spawnctl get data-container $accountContainerName -o json | jq -r '.connectionString')./todo-schema-migration-test.sh $todoConnectionString./account-schema-migration-test.sh $accountConnectionString

Above we provision the two databases, run our schema migration tests, and then leave it up to Spawn to clean up the databases after 15 minutes. That means every time this pipeline runs, we end up with completely isolated database instances, instantly provisioned from the latest production backup for our tests, which is then automatically cleaned up once we’re finished. There’s also no code written to poll databases for being ready, or any database-engine-specific code written for provisioning these. Spawn does all of the heavy lifting for you — just ask for a database, and when spawnctl returns it’s good to go.

Provisioning databases for integration tests

We can also leverage Spawn for integration tests. In this case, we’ve hooked up Spawn with some C# nunit integration tests that hit the API and subsequently cause changes in the databases:

Constructing the dependencies with the Spawn data container connection strings

Here, we’re using a spawn client in the “SetUp” method for each test to create a Spawn data container. This will be used in the API Controller’s store to interact with the database. We then run each test which exercises the controller behaviour and asserting its correctness.

We’ve taken advantage of parallelisation here. We’re provisioning an independent data container for every test in this fixture. This means that there’s no concern over tests modifying the state of the database while others are reading from it and ending up with unexpected state. It also means we can complete our tests quicker, as everything runs in parallel.

Show me the code

If you want to take a look at this in action, check out the open source spawn demo repository on GitHub.

Summary

As demonstrated, Spawn has unshackled our CI pipeline from the chains of a shared, long-lived database. Instead, we now have completely isolated, temporary databases provisioned independently for every single pipeline that gets run.

The tests can now also be parallelised, provisioning independent Spawn data containers for each test that gets run. This will have an enormous decrease in the overall time taken for your CI pipelines as you’re no longer constrained by bottlenecked access to a single shared database.

If you’d like to get access to Spawn to take advantage of these benefits, then get started now at https://spawn.cc.