Kubernetes testing: from Minikube to Multi-tenant clusters

Matthew Flatt
Spawn DB
Published in
5 min readMar 3, 2021

Testing Spawn has always been a problematic; with so many moving parts and reliance on technologies such as Kubernetes, unit tests will never be enough to give us confidence that our application is production quality. This blog post will take a look at our application, the problems and constraints when testing as well as what we have done in the last few months to take a flaky 2.5–3 hour test run to a more consistent 25–35 minute run that is much more representative of production.

Our application

Spawn is a cloud hosted service that delivers databases on demand for dev, CI and testing workflows, with data and in a matter of seconds.

All databases are containers behind the scenes and Kubernetes is used to orchestrate this, as well as host the application itself. Additional complexity is added with a dependency external to the cluster, a Virtual Machine with a worker installed handling storage. Some third party software is used such as NATS to pass messages around between the components and the main testing mechanism uses BATS, for full system tests.

Problem

In the early days of Spawn development, the suite of BATS tests would be run by the developer using their running development instance of Spawn — consisting of VirtualBox, Vagrant and Minikube. Due to resource constraints on the development machine, running more than 1 test at a time was not possible and with a complete test run taking over 2 hours, while blocking any other development work this was not a long term solution.

To offload the testing away from the development machines, the same process was replicated through an Azure DevOps pipeline, spinning up a VM, installing everything required, running the application and tests all on the same VM. While this worked, it had several issues:

  • Tests take too long — 2.5 hours+ to bring up infrastructure and run all tests
  • Differences to production — using Minikube, everything on same VM, not representative
  • Flaky infrastructure — components would often fail to create
  • Code for test infrastructure — Different set up to dev and production, code needed for bringing up/tearing down infrastructure, retrieving logs etc
Single VM testing process

Options

The time taken to get test results and the reliability of the test environment were the biggest drivers behind making a change. There were several options considered:

  1. Bigger VM for Minikube
  • ✅ Would allow tests to be run in parallel, reducing testing time
  • ✅ Allow multiple test runs in parallel
  • ❌ Would expect same test setup flakiness
  • ❌ Not representative of production
  • ❌ Continues need for extra code

2. Create Kubernetes cluster per test run

  • ✅ Would allow tests to be run in parallel, reducing testing time
  • ✅ Allow multiple test runs in parallel
  • ✅ Representative of production
  • ❌ Need code to automate creation of cluster
  • ⏳Would take time to build useable cluster

3. Single Kubernetes testing cluster — 1 test run at a time

  • ✅ Would allow tests to be run in parallel, reducing testing time
  • ❌ Only one test run at a time
  • ✅ Expect same reliability as production
  • ✅ Representative of production
  • ✅ Majority of code reused from production pipelines

4. Single Kubernetes testing cluster, isolated deployments

  • ✅ Would allow tests to be run in parallel, reducing testing time
  • ✅ Allow multiple test runs in parallel
  • ✅ Expect same reliability as production
  • ✅ Representative of production
  • ✅ Majority of code reused from production pipelines
  • ❌ Production code changes needed

Solution

Given the above options, we chose to pursue a single testing Kubernetes cluster that could be deployed to multiple times in parallel. This approach ticked all the boxes with the only potential downside being that changes would need to be made to production code. In the main, this meant removing hard coded namespace values from some of the services which wasn’t a bad thing. We made some decisions before starting the work:

  • Single Kubernetes testing cluster, isolated deployments
  • Cluster always running
  • VM always running for storage worker
  • Additional services installed into cluster and always running, logging, monitoring etc
  • Same deployment mechanism as production
  • Application deployed into unique namespace(s) per test run
  • Dependencies deployed with application (Database, NATS etc)
Kubernetes cluster with isolated deployment per namespace

Results

The results of this work have been overwhelmingly positive, with the highlights being:

Test run time reduced to 25–30 mins

  • Quicker build and deployment (10–15 mins) and parallelised tests has meant test results being available in 25–30 minutes compared to 150–180 minutes
  • Test run speed is now limited by a single, longer running test

More consistent results

  • Reduced issues with test setup and the tests themselves have made test runs more consistent
  • Still work to do on some shared cluster resources

Found and fixed issues with Spawn caused by load

  • Some technical debt that had been put off around scaling was paid off
  • Uncovered and fixed issues that would have been hit in the future as service scaled

Removed hard coded namespaces from code

  • Required to allow multiple deployments to the same cluster
  • Might come in useful later down the line:
  • Blue/Green deployments
  • Customer environments

Catching more deployment issues before merge

  • More representative version of production using the same deployment mechanism
  • Some issues previously missed with the old Minikube tests

More likely to run tests before merge

  • Having to wait 3 hours for test results often led to them not being run
  • Reduced time and consistency means they are run for every product related PR

Added confidence in Kubernetes upgrades

  • Upgrading Kubernetes versions on the testing cluster and running tests multiple times increased confidence when upgrading production

Pinned down and fixed occasional bugs

  • Occasional test failures became a code problem, rather than a testing environment problem
  • A stable cluster integrated with our usual logging tools made it easier to track down and fix these issues

Conclusion

While the headline result is a massive reduction in the time it takes for a test run to complete, the process has improved both Spawn itself as well as the development experience. Our lead time and confidence in code we are releasing have both improved, allowing us to deliver value to our users more quickly and consistently.

How are you testing your Kubernetes based applications? Is it time to move to a full size cluster if you aren’t already?

Spawn is currently open for beta users, get started here.

--

--