Building our own Continuous Integration System

As young software companies grow and mature, most reach a point where they decide to adopt continuous integration: a practice in which developers integrate changes made to the project or codebase very frequently. The benefits of continuous integration (CI) are well known: higher code quality, reduced friction when deploying, and less integration risk among team members. One of the goals of Oscar’s Product Infrastructure team (consisting of Wyatt Anderson, Maura Killeen, and myself) is to make other engineers more productive. We’re constantly trying to improve the development workflow.

We ran our first-ever pre-deployment tests with Jenkins, an open-source automation server, which at the time had limited ability to parallelize our test workload. Later, in order to run our test suite in a parallel fashion, we adopted Kochiku, an open-source project from Square, the payments company. The primary selling point of Kochiku is its focus on distributing test workload across many machines in parallel.

A Good Short Term Solution

For the most part, Kochiku worked as intended. It was composed of 3 parts:

  1. A web app which let us visually inspect our builds and repositories before deploying them
  2. Partitioner jobs which divided our builds and tests into distributable parts
  3. Workers that ran the individual parts of a build

These pieces together allowed Kochiku to run multiple builds in parallel, and gave us a high level of confidence in the quality of our codebase.

However, as Oscar grew, and the number of builds created per day increased, we began to experience problems with our Kochiku installation more frequently.

  1. Kochiku is written in Ruby, which Oscar’s tech stack does not include. We also didn’t deploy it according to Ruby’s best practices.
  2. Oscar uses a mono-repository, and outgrew Kochiku’s design limits. As our codebase increased in size, Kochiku became overloaded more often. When this happened, the web app loaded too slowly, or builds failed to complete. We added more resources to address this issue, but we found ourselves having to do so every few months as more engineers were hired.
  3. Other miscellaneous errors arose which could only be solved by manually logging into a Kochiku server.

As a result, the Product Infrastructure team decided that it was time to rewrite our continuous integration system. We carried over the key concepts that made Kochiku work — the web app, partitioner, and workers — but made the rest of it our own.

A slick UI

Using our component library, we built the web app for our Continuous Integration system relatively quickly. We named it Foreman, as it allows engineers to kick off builds, inspect test output logs, and check on the status of in-progress builds. With Kochiku, the web app would load all data before displaying any information to a user. Foreman loads all data asynchronously, allowing the web app to respond more quickly.

And because we built this ourselves, we were able to add any features we deemed necessary, such as a slick progress bar, helpful tooltips, and build part filtering capabilities.

A smart partitioner

At Oscar, we use Pants, a build system software developed by Twitter, to manage our codebase and to break down projects into individual build parts. A key feature of Pants is its ability to detect which dependent code could be affected after modifying a file in the codebase. This allows us and our CI system to determine which parts need to be rebuilt and retested when integrating changes.

Previously, whenever a commit was submitted for code review, Kochiku’s partitioner would determine all parts that were different between that change and its parent commit. However, even when the author updated their revision with a very minor change, the old partitioner would recalculate all the parts that were different, and all these parts would have to be rerun again by the workers. This slowed down the CI system significantly, and resulted in many engineers waiting for hundreds of build parts to be partitioned and run, even for the slightest update in their revision.

To improve this process, we had it calculate the minimum subset of parts that had to be rerun on subsequent updates for a given revision. For example, if on an initial revision the author changes parts A, B, and C, the partitioner would send parts A, B, and C to be built, just like the old partitioner. But if the author updates only part C after code review (leaving parts A and B untouched), the old partitioner would recreate parts A, B, and C again, while the new partitioner will only send part C to be rebuilt.

Because we implemented the partitioner ourselves, we were able to optimize the partitioning logic to fit our needs, incorporating information from our build graph to eliminate redundant or unnecessary test runs.

An efficient builder

The last part of our CI system we rebuilt were the workers themselves, or the system that actually built the parts and reported the results of each test that was run. The old system suffered from a few inefficiencies:

  1. Unnecessary overhead: for each part that was run, the worker had to first clone the repository. This was especially slow for us because we have a large monorepo.
  2. A stale cache: this was meant to reduce run time, but if the cache was incorrectly populated during one test run, future test runs would be affected and would fail unintentionally.
  3. Resource competition: because tests were not run in isolation, one test run that required a lot of memory could starve other tests that were trying to run on the same machine.

In order to remedy these issues, we decided to base our new worker implementation on two key technologies — Docker and Nomad.

Docker

Docker is a platform for running applications in containers. A key concept in Docker is an image, which is essentially a stack of layers that represent changes in the file system. You can stack and base images off of each other, similar to how you can stack commits off of one another in your revision history.

To take advantage of the layer concept, we first build an image of our repository (the master branch) — this image is downloaded once to all of our CI machines. Then, whenever a revision has parts which need to be built, we create an image that contains only the differences between the revision and master.

Because the master image is already cached on each machine, we only incur the cost of the I/O for the differences in the revision, compared to the costly I/O that used to be involved with checking out the entire repo for every test run. This greatly reduces the issue with unnecessary overhead. Finally, once the tests are about to be run, the image will be downloaded once on the assigned machine, and a separate container will be created for each test run. This solves the problem of a stale cache — each Docker container created contains all the files and layers needed for that test to run.

By incrementally building each image and caching them, we reduce the amount of overhead at run time.

Nomad

Nomad is a tool for scheduling jobs on a cluster of machines. It allows us to dispatch thousands of tests at once, and decides when and where to run them. In addition, Nomad runs each job in isolation, so we can specify the maximum CPU and memory a task should use — this solves the problem of resource competition. Oscar uses Aurora for scheduling most production jobs and services, but chose Nomad over Aurora for this project because of its excellent batch scheduler, its rich HTTP API, and for the fact that Nomad works well on ephemeral machines. Nomad’s scheduler allows us to prioritize and queue test jobs differently based on the total number of test jobs for a given build, so that builds that make minor changes are prioritized over builds with more substantial changes that may need to run thousands of tests.

Looking ahead

Oscar’s continuous integration system is now quicker, more efficient, and more reliable than before. This didn’t happen all at once; Kochiku was replaced piece by piece, in order to provide a smooth transition and experience for our engineers. Having modular interim solutions for these large architectural changes minimizes friction for users.

Having succeeded in creating a system that is fast and scalable, we’ve set the foundation for our next goal: automatically scaling the cluster up and down with EC2 Spot Instances. Making this change will allow us to quickly increase the number of machines when demand is high and reduce it when demand is low, especially at night.

Of course, there are always things we can do to improve our CI system, but for now we’re glad that we’ve created something that gives other engineers one less thing to worry about.