Treat your Build Pipeline as a Product

How we applied product design principles to iterate our build pipeline to save up to 13 hours of engineer time each day.

This is a blog format of the talk I originally presented at Melbourne Ruby in Janurary 2017. A recording of the 20min talk is on YouTube.

Engineers don’t like doing repetitive tasks that involve copious amounts of manual labour, especially performing manual testing. Most teams will eventually come up with some kind of automated testing, and most will have this set up to run automatically when code is pushed.

This approach is generally known as a build pipeline, or an aspect of Continuous Integration (CI).

However, the usability of a build pipeline differs significantly between teams — even within the same company. It’s often an afterthought, and everyone is terrified of making any changes to it in case it breaks. Or nobody really understands how it works because the person who set it up left the company years ago.

Builds get flakier. The build time is starting to hit the half hour mark. There’s not enough build server capacity so your build doesn’t start for another three hours.

Sound familiar?

I know from personal experience that a build pipeline that’s slow and unreliable is a major drag on productivity and morale, which is why we’re trying something different at myDr: we treat our build pipeline as a product.


Why is a build pipeline a product?

I believe that everything that provides value to users is a product.

In this case, a build pipeline provides value to the team by automating tests and decreasing the likelihood of bugs hitting production, and thus the primary users of a build pipeline are the engineers.

Why treat the build pipeline as a product?

Product design principles and processes have worked well for creating quality products. Why not apply them to tooling?

First, let’s take a step back and think about the primary user flow for a build pipeline.

Build Pipeline User Flow

The user flow tends to look something like this:

  • Push a change
  • Wait for the build to complete
  • If it’s broken, figure out what broke and why
  • Fix the problem
  • Push the fix
  • Rinse and repeat until the build passes

With this in mind, let’s look at our first cut of the build pipeline for myDr Go. But first, a quick overview of our stack for context.

myDr Go Stack Overview

  • Rails backend, deployed inside containers to a DC/OS cluster
  • React frontend, deployed to mobile platforms via Cordova. ES6, Flow, GraphQL, styled-components
  • Buildkite for CI

Iteration #1

I decided to focus on integration testing first as it provides the most confidence that all the pieces work together.

We initially selected Nightwatch, a JS integration testing tool as it was popular and in active development, which made sense because the frontend is all JS.

We use docker-compose to boot the backend, frontend, test container, and selenium.

Nightwatch output

There were many problems with this first iteration:

  • It took a significant amount of effort for an engineer to understand what broke and why, as one needed to read the test source to understand what happened before it failed.
  • The failure screenshots had to be downloaded separately to be viewed.
  • The tests had to be run locally to see the JS errors.
  • We were unable to test our core flow of a video consultation.
  • Writing the test itself was painful. Needing to chain expectations and specify manual waits did not help with the usually-painful task of writing integration tests. Worst of all, a function named setValue would actually append a value to a field rather than simply set it, and the issue from 2014 was closed without a fix: https://github.com/nightwatchjs/nightwatch/issues/4.

From this iteration, we came up with some goals.

Goal: Decrease effort to understand failure.

Where “effort” is a combination of the time it takes, and the cognitive load required (i.e. how much thinking an engineer needs to do) to understand the failure.

Goal: Make writing tests not suck.

Writing integration tests tend to involve a painful cycle of write, run, waiting, making a small change, and running again.

Iteration #2

With the goals in mind, we went back to the drawing board and did a bunch of thinking to help us reach our goals. A week later, the pipeline looked like this:

Iteration #2 build output

The biggest (and probably most controversial) change was switching from Nightwatch for testing to RSpec/Capybara. It was a hard decision, but ultimately Capybara’s maturity and the ease of writing tests won out, and our frontend engineers are comfortable writing tests as they’re straightforward and less verbose.

We added JavaScript console output, which was by far the most useful feature in this iteration, as the error in the log was generally enough information for the engineer to fix the problem rather than running the test locally for that information.

We added step-by-step output via Ruby’sTracePoint API which allows engineers to quickly understand what happened before the failure.

We added inline screenshots which allows engineers see what the failure looks like visually without needing to run

We were finally able to automate a full consultation test (including video call over WebRTC) by using multiple sessions in Chrome via Selenium.

Old screenshot of an automated multi-session test that ensures WebRTC video calls work.

We also added the ability to interactively write tests via a console which accepts RSpec/Capybara commands and logs successful commands to a buffer which can be dumped and copied straight into a spec. This allowed me to write a 100+ line comprehensive spec that tests our “happy” user flow in a couple of hours and let me keep most of my sanity.

However, there were still issues.

The actual error and backtrace were at the end of the output (default RSpec behaviour), and we would need to scroll through a lot of output to find the actual issue, making it time consuming to understand the error.

The build notifications were also an issue. Email is noisy enough as it is and generally requires people to check their emails rather than receive a notification. The default Buildkite Slack integration is noisy as it has everyone’s builds, not just your own.

With this, we added another goal:

Goal: Decrease the effort required to know the build failed.

Iteration #3

Collapsed-by-default build output. We used Buildkite’s collapsable build output feature to hide successful tests, and gained test durations for free. See https://github.com/chendo/buildkite-rspec-formatter for more details.

Only see output for tests that failed.

Need-to-know Slack notifications. We built a bot that sends the relevant person a direct message in Slack when their build fails, if they’re mentioned on Github, or someone comments on their PR.

Targeted issue and build notifications on Slack.
Pull request comment with code context.

Per-step JavaScript logs. Rather than trying to figure out which line caused what output, show it after each step.

Per-step JavaScript console output.

Again, there are still things to improve.

We still need to do manual testing to verify that nothing was broken visually.

Build time was starting to hit the 15 minute mark as we added more test coverage. We run multiple build agents per node and as the frontend tests are very CPU-intensive compared to the backend tests, the tests began failing due to timeouts as concurrent frontend builds on the same host would cause it to grind to a halt. The team could no longer trust the build result.

Thus, more goals.

Goal: Faster feedback cycle

This one shouldn’t require any explanation.

Goal: Must be able to trust results

The lack of confidence that errors were actual problems meant that engineers would assume that any build error they encountered was the fault of a flaky build, not an actual bug they introduced.

Goal: Prevent undesired visual changes

The less manual testing we have to do, the better.

Iteration #4

After quite a bit of research, we came up with the following improvements.

Visual regression testing with Percy.io. There are many players in the visual regression space, but we needed one that worked well with our RSpec/Capybara stack.

Percy.io fit the bill as it has Capybara support, making it trivial to add to add to our pipeline.

Percy’s delta view where it highlights visual changes.
Clicking on the right screenshot toggles between diff and actual view.
What a spec looks like with Percy.io snapshots and multi-session helpers.

Test parallelisation. We use Knapsack Pro to parallelise our test suite across multiple Buildkite agents, and restructured our 4-core build agents so that each host runs 3 normal agents, and 2 tagged with cpu=high. All frontend integration tests are scheduled to run on cpu=high nodes, so we no longer have builds crawling to a halt due to over-provisioning. We also decreased build times from ~15min back down to ~5min.

Impact

It’s difficult to measure how much cognitive load these improvements have prevented, and we don’t have enough data points as to how much time we saved, but we can do some rough estimates:

Per failure:

  • Inline steps before failure without looking at test source: 30s-2min effort saved
  • View screenshots inline rather than downloading: 15–60s effort saved
  • JavaScript console/errors in output rather than running test locally, connecting to VNC and inspecting console: 60s–5min running and waiting saved

Total: 2–7min saved per failure (rounded up to 2min)

Per build:

  • Test parallelisation: 5–10min saved per build
  • Collapsed build output: 30–60s saved scanning output per build
  • Build notifications rather than ‘polling’: 30s-5min waiting saved per build
  • Automated consultation call testing: 5–10min manual testing saved per build

Total: 11–26min saved per build.

Others:

  • Interactive test writing: 5–30min+ running and waiting saved per new test
  • PR/issue notifications rather than ‘polling’: 30s-5min waiting saved per PR/issue comment
  • Visual regression testing: 20min–1hr error-prone manual testing saved per release

In the last 30 days, we ran 633 builds for an average of 31.65 builds per working day. Let’s round it down to 31.

If our estimations are correct, then the improved pipeline saves us 5–13 hours per day. Even if we assume that all the builds are green, and we only need to do the automated call testing on a release, we’re still saving at least 2.8 hours per day.

Let’s assume 20% of the builds fail due to an average of 3 failures. That’s 37min-2hr saved on just understanding errors alone.

Summary

Applying product design principles to our build pipeline enabled us to critically think about what the goals that we want from it. With this knowledge, we spent a small portion of our development capacity to improve the build pipeline let us to save up to 13 hours a day, and keep our engineers happy.

Give your tooling some love. It’s worth it!