Treat your Build Pipeline as a Product
How we applied product design principles to iterate our build pipeline to save up to 13 hours of engineer time each day.
This is a blog format of the talk I originally presented at Melbourne Ruby in Janurary 2017. A recording of the 20min talk is on YouTube.
Engineers don’t like doing repetitive tasks that involve copious amounts of manual labour, especially performing manual testing. Most teams will eventually come up with some kind of automated testing, and most will have this set up to run automatically when code is pushed.
This approach is generally known as a build pipeline, or an aspect of Continuous Integration (CI).
However, the usability of a build pipeline differs significantly between teams — even within the same company. It’s often an afterthought, and everyone is terrified of making any changes to it in case it breaks. Or nobody really understands how it works because the person who set it up left the company years ago.
Builds get flakier. The build time is starting to hit the half hour mark. There’s not enough build server capacity so your build doesn’t start for another three hours.
I know from personal experience that a build pipeline that’s slow and unreliable is a major drag on productivity and morale, which is why we’re trying something different at myDr: we treat our build pipeline as a product.
Why is a build pipeline a product?
I believe that everything that provides value to users is a product.
In this case, a build pipeline provides value to the team by automating tests and decreasing the likelihood of bugs hitting production, and thus the primary users of a build pipeline are the engineers.
Why treat the build pipeline as a product?
Product design principles and processes have worked well for creating quality products. Why not apply them to tooling?
First, let’s take a step back and think about the primary user flow for a build pipeline.
Build Pipeline User Flow
The user flow tends to look something like this:
- Push a change
- Wait for the build to complete
- If it’s broken, figure out what broke and why
- Fix the problem
- Push the fix
- Rinse and repeat until the build passes
With this in mind, let’s look at our first cut of the build pipeline for myDr Go. But first, a quick overview of our stack for context.
myDr Go Stack Overview
- Rails backend, deployed inside containers to a DC/OS cluster
- React frontend, deployed to mobile platforms via Cordova. ES6, Flow, GraphQL, styled-components
- Buildkite for CI
I decided to focus on integration testing first as it provides the most confidence that all the pieces work together.
We initially selected Nightwatch, a JS integration testing tool as it was popular and in active development, which made sense because the frontend is all JS.
docker-compose to boot the backend, frontend, test container, and
There were many problems with this first iteration:
- It took a significant amount of effort for an engineer to understand what broke and why, as one needed to read the test source to understand what happened before it failed.
- The failure screenshots had to be downloaded separately to be viewed.
- The tests had to be run locally to see the JS errors.
- We were unable to test our core flow of a video consultation.
- Writing the test itself was painful. Needing to chain expectations and specify manual waits did not help with the usually-painful task of writing integration tests. Worst of all, a function named
setValuewould actually append a value to a field rather than simply set it, and the issue from 2014 was closed without a fix: https://github.com/nightwatchjs/nightwatch/issues/4.
From this iteration, we came up with some goals.
Goal: Decrease effort to understand failure.
Where “effort” is a combination of the time it takes, and the cognitive load required (i.e. how much thinking an engineer needs to do) to understand the failure.
Goal: Make writing tests not suck.
Writing integration tests tend to involve a painful cycle of write, run, waiting, making a small change, and running again.
With the goals in mind, we went back to the drawing board and did a bunch of thinking to help us reach our goals. A week later, the pipeline looked like this:
The biggest (and probably most controversial) change was switching from Nightwatch for testing to RSpec/Capybara. It was a hard decision, but ultimately Capybara’s maturity and the ease of writing tests won out, and our frontend engineers are comfortable writing tests as they’re straightforward and less verbose.
We added step-by-step output via Ruby’s
TracePoint API which allows engineers to quickly understand what happened before the failure.
We added inline screenshots which allows engineers see what the failure looks like visually without needing to run
We were finally able to automate a full consultation test (including video call over WebRTC) by using multiple sessions in Chrome via Selenium.
We also added the ability to interactively write tests via a console which accepts RSpec/Capybara commands and logs successful commands to a buffer which can be dumped and copied straight into a spec. This allowed me to write a 100+ line comprehensive spec that tests our “happy” user flow in a couple of hours and let me keep most of my sanity.
However, there were still issues.
The actual error and backtrace were at the end of the output (default RSpec behaviour), and we would need to scroll through a lot of output to find the actual issue, making it time consuming to understand the error.
The build notifications were also an issue. Email is noisy enough as it is and generally requires people to check their emails rather than receive a notification. The default Buildkite Slack integration is noisy as it has everyone’s builds, not just your own.
With this, we added another goal:
Goal: Decrease the effort required to know the build failed.
Collapsed-by-default build output. We used Buildkite’s collapsable build output feature to hide successful tests, and gained test durations for free. See https://github.com/chendo/buildkite-rspec-formatter for more details.
Need-to-know Slack notifications. We built a bot that sends the relevant person a direct message in Slack when their build fails, if they’re mentioned on Github, or someone comments on their PR.
Again, there are still things to improve.
We still need to do manual testing to verify that nothing was broken visually.
Build time was starting to hit the 15 minute mark as we added more test coverage. We run multiple build agents per node and as the frontend tests are very CPU-intensive compared to the backend tests, the tests began failing due to timeouts as concurrent frontend builds on the same host would cause it to grind to a halt. The team could no longer trust the build result.
Thus, more goals.
Goal: Faster feedback cycle
This one shouldn’t require any explanation.
Goal: Must be able to trust results
The lack of confidence that errors were actual problems meant that engineers would assume that any build error they encountered was the fault of a flaky build, not an actual bug they introduced.
Goal: Prevent undesired visual changes
The less manual testing we have to do, the better.
After quite a bit of research, we came up with the following improvements.
Visual regression testing with Percy.io. There are many players in the visual regression space, but we needed one that worked well with our RSpec/Capybara stack.
Percy.io fit the bill as it has Capybara support, making it trivial to add to add to our pipeline.
Test parallelisation. We use Knapsack Pro to parallelise our test suite across multiple Buildkite agents, and restructured our 4-core build agents so that each host runs 3 normal agents, and 2 tagged with
cpu=high. All frontend integration tests are scheduled to run on
cpu=high nodes, so we no longer have builds crawling to a halt due to over-provisioning. We also decreased build times from ~15min back down to ~5min.
It’s difficult to measure how much cognitive load these improvements have prevented, and we don’t have enough data points as to how much time we saved, but we can do some rough estimates:
- Inline steps before failure without looking at test source: 30s-2min effort saved
- View screenshots inline rather than downloading: 15–60s effort saved
Total: 2–7min saved per failure (rounded up to 2min)
- Test parallelisation: 5–10min saved per build
- Collapsed build output: 30–60s saved scanning output per build
- Build notifications rather than ‘polling’: 30s-5min waiting saved per build
- Automated consultation call testing: 5–10min manual testing saved per build
Total: 11–26min saved per build.
- Interactive test writing: 5–30min+ running and waiting saved per new test
- PR/issue notifications rather than ‘polling’: 30s-5min waiting saved per PR/issue comment
- Visual regression testing: 20min–1hr error-prone manual testing saved per release
In the last 30 days, we ran 633 builds for an average of 31.65 builds per working day. Let’s round it down to 31.
If our estimations are correct, then the improved pipeline saves us 5–13 hours per day. Even if we assume that all the builds are green, and we only need to do the automated call testing on a release, we’re still saving at least 2.8 hours per day.
Let’s assume 20% of the builds fail due to an average of 3 failures. That’s 37min-2hr saved on just understanding errors alone.
Applying product design principles to our build pipeline enabled us to critically think about what the goals that we want from it. With this knowledge, we spent a small portion of our development capacity to improve the build pipeline let us to save up to 13 hours a day, and keep our engineers happy.
Give your tooling some love. It’s worth it!