Reducing Serverless Integration Test Runtime on CI — From 40 minutes to 7
Some napkin maths: If a 3 dev team ran their tests 5 times per ticket for 2 tickets a day, then every CI minute costs 130 dev hours per year
Recently I was working on a Serverless project which ran Integration tests on its microservices in its CI pipeline.
In particular we ran BDD (Behaviour Driven Development) tests which for those unfamiliar might look like this.
Scenario: I update the job of a user
Given a set of users
| name | job |
| Rob | Developer |
| Simba | Outcast |
| Woody | Friend |
When I update the “job” of the user named “Simba” to “King”
And I fetch the list of users
Then I will see a list with the following attributes
| name | job |
| Rob | Developer |
| Simba | King |
| Woody | Friend |
These scenarios were written in a business friendly language where each instruction was tied to a function that carried out the logic it expressed.
The tests ran against an actual environment and made real calls to our service.
Therefore, we needed a new environment every time we ran our tests to avoid concurrent tests making conflicting calls (e.g. one test deletes a user while another expects to see that user).
Luckily with a Serverless approach creating new environments is straight forward.
While these tests are valuable, the common counter argument is that they take too much time to setup, seed and run.
Before and After
We ran into this issue on our project where our Integration tests were taking over 40 minutes to run on our CI.
The negatives of this are obvious: backend features took significantly longer, a flaky test could cost us hours, code reviews became a bit laxer to avoid re running the tests etc.
So we tackled this problem and managed to reduce our BDD Integration test time from roughly 40 minutes to under 7.
We used a few different tactics to achieve this
- Reusing existing environments (~17 mins saved in deploy/teardown)
- Parallelising the BDD tests (~15 mins saved in test)
- Config Tweaks (~2 mins saved overall)
Reusing existing environments/stacks (~17 mins saved in deploy/teardown)
The creation of our stacks was taking up to 18 mins per CI build. Granted this is an above average create time as we were assigning a custom domain name to each of our stacks (which has other benefits).
Regardless, the creation of a new stack takes significantly longer than an update to an existing stack, indicating we could save time by reusing stacks.
Lock Table of Available Stacks
The number of builds running at any one time would be subject to change so in order to be flexible we created a “lock” table of our stacks. In essence, it would allow you to claim a stack that is currently available and lock it to prevent another build from also using it. This ended up saving us roughly 17 minutes per CI build.
How it works
You deploy the Serverless microservice with 4 lambdas and DynamoDB table from this repo bdd-lock-table. This table will store your available stacks.
When your CI runs and needs a stack, it checks if there are any available stacks for its repo (by calling a lambda on the /get-available route)
If there is:
- It claims a stack (/claim-stack) and marks it as not available (using a transact write to ensure only one job can claim a stack at a time)
- It then updates the stack with its code and runs its tests
- After (success or fail) it marks the stack as available again (/release-stack)
- It creates a new entry in the table with a random stack name (/create-stack)
- It creates a new stack with this name and runs its tests
- After (success or fail) it marks the stack as available so that future builds can also use this new stack (/release-stack)
After your CI has run for a few days/weeks with normal development, it should have experienced a busy period and therefore have enough stacks created (which are cheap with Serverless) that any new job can always claim an existing stack.
The deploy required for any Integration test will now nearly always be an update deploy rather than a create deploy.
Implementing it for your project
I’ve created two repos with more information about how to set this up.
- The code to create your own lock table service can be found in the bdd-lock-table repo. It contains a postman collection and also a sample script to run in your CI build which will execute the above logic
- An example setup of a repo that uses the table in its CI can be found in the sls-bdd-python-optimised-ci repo
Parallelising the BDD tests (~15 mins saved in test)
As our service grew over time, so did the length of time the actual tests took to run and so we looked to parallelising these tests.
Parallelisation of any job running on CI can be a quick win, not just for long running integration tests.
Implementing in CircleCI
We were using CircleCI which provides a good toolset to enable parallelisation. The “parallelism” keyword will run the same job on x number of containers.
Note: it will depend on your plan whether this is available to you (see my napkin maths at the top of this article to see if the upgraded plan is worth it to you).
— image: circleci/python:3.7.4-node
name: Run BDD tests
CircleCI provides a “tests” command which will deal with ensuring each container runs different tests.
First, it allows you to specify a glob pattern to determine which files are relevant for testing.
There are a number of provided methods for splitting the files among the containers. We chose the “— split-by=filesize” method since file size is generally a good indicator for length of time a BDD test will take.
If you had a “yarn test:bdd” command you can then run the following in your “ci-bdd.sh” script which is run by each container but each one would then run different tests:
circleci tests glob “features/**/*.feature” | circleci tests split — split-by=filesize | yarn test:bdd
In our case we had 10 parallel jobs which would each claim an existing stack, do an update deploy, run 10% of the tests and then release the stack.
This cut the deploy and test part of our pipeline from 20 minutes to about 6.
We have cut the time to run tests on our Pull Requests from about 40 minutes to 7 minutes.
If each developer did one backend ticket a day this would save each of them about 15–20 working days a year!
We had an extreme case but in most cases, these two methods can shave valuable minutes off your CI pipeline.
Try them out to see how they help you