Reducing Inertia’s continuous integration build time by over 75% using build stages, caching, and job concurrency

Robert Lin
Jun 10, 2018 · 7 min read

Travis CI is the continuous integration tool most of the teams at UBC Launch Pad — UBC’s student-run software engineering club — use to handle their projects’ CI needs. We currently have 14 Travis-enabled projects, 8 of which have been active in the last month.

At Launch Pad, I have mostly been working on Inertia, our in-house continuous deployment tool. Ever since we first set up a Travis CI pipeline, Inertia has always had rather long build times — there’s a number of integration tests in the codebase, and the build process involves several Docker image builds that can take quite a while.

Yesterday, however, Launch Pad’s new Programming Language Team came over and asked if we could cancel a few of our builds, since it seemed to be holding up their pull request builds. That’s when I realized just how long Inertia builds were taking — upwards of 90 minutes total. Yikes.

This post will go over the steps I took to remedy this situation and talk about how to take advantage of Travis’s caching, build stages, and job concurrency features.

Contents

  • Inertia’s CI Pipeline
  • Getting Rid of Jobs
  • Caching Assets via Build Stages
  • Taking Advantage of Job Concurrency
  • Results

🚚 Inertia’s CI Pipeline

For some context, our old build procedure went something like this:

  1. Install dependencies and utilities like a coverage reporter and set of binaries used for linting
  2. Build and start a Docker image for a “mock virtual private server” for e2e testing of our bootstrap process
  3. Build a Inertia daemon Docker, also for testing our bootstrap process
  4. Execute tests — this includes unit tests, bootstrap tests, and build simulations, all of which takes about 7 minutes

This procedure is repeated for a matrix of environment variables, each representing a VPS platform we wanted to support, defined in our .travis.yml configuration:

env:
matrix:
- VPS_OS=ubuntu VERSION=16.04
- VPS_OS=debian VERSION=9.3
- VPS_OS=centos VERSION=7
- VPS_OS=amazon VERSION=latest
- VPS_OS=ubuntu VERSION=latest
- VPS_OS=debian VERSION=latest
- VPS_OS=centos VERSION=latest
allow_failures:
- env: VPS_OS=ubuntu VERSION=latest
- env: VPS_OS=debian VERSION=latest
- env: VPS_OS=centos VERSION=latest

Each of these entries generates a “job”, each of which runs concurrently (up to a limit) and does the same thing. The allow_failures entry represents jobs that will run, but are allowed to fail without marking the entire build as failed.

This means that for each of these entries, Travis runs through our entire pipeline, even though the only difference is in a single bootstrap test. It also runs our linter and builds an Inertia daemon Docker image once for each job, even if they are identical across each job. We also realized the allowed_failures jobs were a bit useless — they never really failed for any legitimate reason, and most VPS services did not offer such bleeding edge operating systems anyway. This gave me a good place to start.

🏃 Reducing Travis Build Runtime

1. Getting Rid of Jobs 😱

This first step was pretty straight forward — removing the allowed_failures jobs from our build matrix. This easily shaved more than 10 minutes off a typical build’s runtime and lopped off a whopping 36 or so minutes from each build’s total runtime. 🔪 While these builds were nice to have, they weren’t really contributing any real development value, so away they go.

2. Caching Assets via Build Stages 📦

This step requires taking advantage of Travis’s caching and build stage features. The latter is currently been beta (though it has been for a while), but I’ve found that it works pretty well.

Caching is typically used to persist data across sequential builds to skip lengthy steps that do not often change, such as installing Ruby dependencies. I’m not sure how I feel about that, however — I prefer each build to go through the same steps as every other build.

Instead, I opted to use it in combination with Travis’s build stages instead — while jobs run concurrently, stages run sequentially, and it seemed a good place as any to prepare and cache all the shared dependencies of each test job before starting Inertia’s test scripts.

To do this, first set up the directories you want cached:

cache:
directories:
- vendor
- images
- daemon/web/node_modules

Then, set up a job and stage that will handle populating these caches:

jobs:
include:
- stage: precache
script: true
install:
- go get github.com/golang/dep/cmd/dep
- dep ensure # cache vendor
- make testdaemon-image # cache images
- make web-deps # cache node_modules

To use this cache, make sure:

  • all jobs that require these dependencies run in stages after the precache stage
  • all jobs that require these dependencies have the exact same env and language setup — any deviations will result in Travis attempting to look for a cache with the same properties
  • make sure that you don’t repeat any of the installation handled in the precache stage

When you run your next Travis build, you should see something like the following in your build logs:

The precache stage is followed by all your other stages — note how each job’s setup (language and environment) is identical.

And that’s it! You now have a build stage dedicated to setting up your build dependencies — installation no longer needs to be repeated for each job, since each will pull the required assets from the cache. For Inertia, this took 3 to 4 minutes off each build, which saves more than 12 minutes of total runtime per build. 🙈 Since a test build occurs in this stage as well, any compilation error will immediately fail and stop the entire build before any tests start, which could potentially save quite a lot of time as well.

3. Taking Advantage of Job Concurrency 🚆

While build stages run sequentially, all jobs within a build stage will run at the same time. It’s pretty common practice to take advantage of this by not just paralellizing different environment setups, but entirely different tasks as well.

In Inertia’s case, our tests have 3 main components:

  • the bulk of the test suite (unit and integration)
  • one test that runs throught Inertia’s bootstrap process
  • linting and static analysis

Only the second component requires a different environment setup, and the last component is entirely separate from the rest — each can be run in parallel, and the first component only needs to be run once. So I decided on the following setup:

Travis cache is populated by the precache stage, and each job of the test stage will run concurrently.

To set this up in our Travis configuration:

jobs:
include:
# ... - &test
stage: tests and static analysis
install:
- make testenv VPS_OS=ubuntu VPS_VERSION=16.04
- bash test/docker_deps.sh
before_script: make testdaemon-scp
script:
- go test -race -coverprofile=coverage.out ./...
after_success:
- go get github.com/mattn/goveralls
- goveralls -coverprofile=coverage.out
# Make sure bootstrap works on various other VPS platforms
- &bootstraptest
<<: *test
install: make testenv VPS_OS=debian VPS_VERSION=9.3
script: go test ./... -v -run 'TestBootstrapIntegration'
after_success: true
- <<: *bootstraptest
install: make testenv VPS_OS=centos VPS_VERSION=7
- <<: *bootstraptest
install: make testenv VPS_OS=amazon VPS_VERSION=latest
- &bootstraptest
<<: *test
install: make testenv VPS_OS=debian VPS_VERSION=9.3
script: go test ./... -v -run 'TestBootstrapIntegration'
after_success: true
- <<: *bootstraptest
install: make testenv VPS_OS=centos VPS_VERSION=7
- <<: *bootstraptest
install: make testenv VPS_OS=amazon VPS_VERSION=latest
# Run linter and static analysis
- install: bash test/lint_deps.sh
script: make lint

Note how the bootstraptest inherits the &test stage name and before_script, but not its script or after_script — instead of running the entire test suite like the first test does, they just run one test. The last job manages linting and static analysis, and runs within the same stage.

In order: precache, all tests, bootstrap tests (737.3 ~ 737.5), and the linter.

🛌 Results

With these changes I got some pretty sizeable improvements — granted, a lot of these improvements were only possible thanks to some poor decisions I made regarding the old Travis configuration, but oh well.

A Travis build is triggered each time a commit is pushed, and if a pull request is open for that branch, two buids are triggered (a PR build and a branch build), so these improvements are really going to add up.

The new configuration should also give much better feedback — based on which job failed, we will quickly be able to tell what went wrong, whether it was the linter or one of the bootstrap tests. Since the precache stage runs the daemon image build, in the event of a compilation error the whole build will quickly fail and stop as well.

Hopefully this will mean we won’t be holding up everyone else’s builds as much anymore 😛


ubclaunchpad.com — UBC’s leading software engineering club.
Stay in touch: Facebook | Instagram | LinkedIn

UBC Launch Pad Software Engineering

Launch Pad is student-run software engineering club at the University of British Columbia.

Robert Lin

Written by

📊 more posts and other stuff at bobheadxi.dev

UBC Launch Pad Software Engineering

Launch Pad is student-run software engineering club at the University of British Columbia.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade