Slaying the Slow CI Dragons

Published in

motive-eng

9 min readOct 15, 2021

In this post we reveal how we at KeepTruckin reduced the duration of our Ruby on Rails CI pipeline from more than an hour to less than 10 minutes. While we enjoyed a conspicuous result when we sped up our CI by five times, our solution did not comprise of one equally outstanding, dramatic improvement. This is because there was nothing notably wrong with our system; rather, it was simply enormous and growing. Instead, our solution consisted of many smaller iterative steps that together resulted in a dramatic reduction in CI processing time.

At KeepTruckin, we have several hundred engineers located around the world, all working on a Rails monolith that powers our mobile and web apps, along with our hardware devices in the field. Hundreds of thousands of truckers depend on our hardware and software for their day-to-day work and for moving safely across North America. Today our Puma server farm that runs this monolith — the server farm that handles all of our daily operations — handles about 10,000 requests per second at peak.

In most cases, a Rails app’s test suite is the ugliest part of the application. A growing team and a bigger codebase only exacerbate the problem. In the past few years, we saw significant growth in our customer base, our team and our codebase. It is no surprise, then, that all this growth caused our CI speed to degrade over time. The increasing duration of each build heavily impacted our developers’ productivity as well as our release process. On average, a single branch build of our Rails monolith started taking more than one hour.

Refer to the image above. We used to run two nodes, with a single process each, on SemaphoreCI. Thread #2 shown above ran unit tests only while everything else, including RuboCop, ran on Thread #1.

We couldn’t let our slow CI continue to impact the speed at which we ship changes to our customers, so we rolled up our sleeves and got to work.

First, We Split Test Files Across Multiple Nodes

Running all unit tests in a single Rake task can be very time consuming. In 2020, as part of a hackathon project, we implemented a quick solution that split test files across multiple nodes based on file size. We distributed tests across nodes using the following calculation:

N = number of nodes
S = total size of all test files

Then we created N buckets, each containing test files with total size <= S/N

Splitting test files across nodes based on file size

Lacking a better metric to use as a criterion, we stuck with this naive file-size-based approach. It gave us a very quick win, but the improvements didn’t sustain us for long. As our company grew, more and more features were added. Each feature came with its own set of tests, so in just a few months, the distributed jobs were still bottlenecked at 46 minutes.

46 minutes for a single build significantly harms developer productivity. We knew we could keep adding more nodes and increasing our CI cost, but that’s not a good long-term solution. We decided to dig deeper and come up with a better solution.

Next, We Introduced Parallel Testing

The first obvious solution was to introduce parallel testing at the node level. By running only a single process per node, we were only utilizing 1/8th of the processing power we were paying for.

We were working on Rails 5.2 at that time, and this version didn’t offer the native parallel testing functionality. After conducting a little research, we decided to use the parallel_tests gem.

Implementing parallel_tests improved our CI speed by ~55% with the same cost.

CI build times after implementing parallel tests

We Integrated Knapsack Pro

The build in the image above highlights a very interesting issue: runtimes are unevenly distributed. For example, Job #5 only took ~9.5 minutes, while Job #6 took almost 22 minutes. This proved that our initial hacky solution of splitting test files across nodes based on file size was way off the mark, and we needed a better way to split files.

We decided to integrate Knapsack Pro into our Rails app. While there is a free version of this gem that can calculate the run times of all your test files, buying the ”Pro” version gave us an out-of-the-box solution for splitting tests equally among nodes based on run time, not file size. It worked well with our implementation of parallelism on the CI server.

We really like the queue mode functionality that Knapsack provides. It solves a very interesting and complex problem. There are a number of tests queued on the Knapsack Pro server. The queue is consumed by every process on every CI node using Knapsack Pro API until the queue has been depleted. In this way, tests are optimally distributed across CI servers, avoiding bottlenecks caused by overworked servers.

Then We Split Large Test Files and Changed Node Configuration

Unfortunately, even with Knapsack Pro, we still weren’t getting the big win that we were looking for. After reviewing the run times of some of our tests, we realized that there were test files taking between 10 and 12 minutes to run. The setup steps alone (database setup, gem installation, and so forth) take approximately six minutes for each process. With a minimum of 12 minutes to run the tests, we were stuck at about 18 minutes.

We decided to split the large test files into smaller ones. Test files taking longer than five minutes had to be broken down, but this was a very tedious manual task. We had more than 800 test files and more than 30 of them were taking longer than five minutes to run.

We also realized that running eight processes on six nodes significantly overworked our nodes, sometimes resulting in nodes running out of memory. As an alternative, we ran six processes on eight nodes, and moved all the non-test-suite checks like RuboCop to a separate node.

The results were quite rewarding. Our build time was down to only ~12 minutes.

CI build time much better after splitting large test files and configuring nodes

And our average build time was down to ~14 minutes.

We Used TestProf and Deferred Our Garbage Collection

At an average of 14 minutes per build, we were in much better shape, but we understood that for further improvement, we’d have to make some application-level changes. A good rule of thumb is: when in doubt, always measure! So, we integrated TestProf to profile our test suite. The first thing that stood out was garbage collection.

Ruby uses the mark-and-sweep garbage collection strategy. Lines 2 (sweeping) and 3 (marking) in the above image demonstrate that a decent amount of CPU cycles are consumed by garbage collection. After doing a little research and feeling inspired by this blog, we implemented a deferred garbage collection strategy which looks like this:

# This class contains methods for deferred garbage collection which
# improves the time consumption of tests. by default it will delay the
# GC to garbage collect every 20 seconds. however this time can be controlled
# via env variable DEFER_GC if needed
class DeferredGarbageCollection
  DEFERRED_GC_THRESHOLD = (ENV['DEFER_GC'] || 20.0).to_f

  @last_gc_run = Time.now.utc

  def self.start
    GC.disable if DEFERRED_GC_THRESHOLD > 0
  end

  # Checks if last run time exceeds the value of the time threshold and
  # enables Garbage Collection for a short cleanup. After that Garbage
  # Collection is deferred again.
  def self.reconsider
    if DEFERRED_GC_THRESHOLD > 0 && Time.now.utc - @last_gc_run >= DEFERRED_GC_THRESHOLD
      GC.enable
      GC.start
      GC.disable
      @last_gc_run = Time.now.utc
    end
  end
end

# this class ensure the deferred garbage collection strategy is used after each minitest run
# after every test it calls DeferredGarbageCollection.reconsider to check if we need to run GC or not
module MiniTestDeferredGC
  def before_setup
    DeferredGarbageCollection.start
    super
  end

  def after_teardown
    super
    DeferredGarbageCollection.reconsider
  end
end

We increased our garbage collection intervals to 20 seconds, and this reduced the CPU usage by almost 40 percent.

After implementing the new garbage collection strategy

Additional (Minor) Optimizations

Caching Dependencies

Reviewing our builds after the optimizations just discussed, we noticed that almost 40% of our time is devoted to setting up the app; that is, setting up databases, installing gems, etc. Running the bundle install command took two minutes on average. Because new gems are not introduced very often, it made sense to cache these dependencies. We used the dependency caching feature provided by Semaphore. This saved us about two minutes per build.

Bootsnap

To improve load time of the app, we used Bootsnap, built by Shopify. Bootsnap first caches everything locally, and then on the second run, it loads from cache, which is much faster. But this won’t work on CI because for every run, a new machine is spun for a build. We thought of a small improvisation there: we created a symbolic link between the Bootsnap cache directory and the general cache directory that is shared across all nodes.

mkdir -p $SEMAPHORE_CACHE_DIR/bootsnap/cache
mkdir -p $SEMAPHORE_PROJECT_DIR/tmp
ln -s $SEMAPHORE_CACHE_DIR/bootsnap/cache $SEMAPHORE_PROJECT_DIR/tmp

Simple Passwords in Tests

We use Devise for authentication, which uses bcrypt for computing password hashes. Bcrypt takes quite a while to compute password hashes, and in tests we generally don’t need such security. Instead, we built an incredibly simple and fast crypto that simply reverses the input string. It saves a considerable amount of time to not have to worry about password security in testing environments.

module BCrypt
  class Password < String
    def initialize(encrypted)
      @encrypted = encrypted
    end

    def is_password?(unencrypted) # rubocop:disable Naming/PredicateName
      @encrypted == unencrypted.reverse
    end

    def self.create(unencrypted, **)
      unencrypted.reverse
    end
  end
end

Final Results

After all of these small and large optimizations, we finally reached a comfortable point with respect to our test suite run time. For a large Rails monolith, we were now running our entire test suite in about eight minutes as opposed to over an hour.

Next Steps

Cache database: The majority of Rails apps require that we compile assets, migrate the database, and run bundle install before we can run any tests. About one third of CI was consumed by these tasks. We took care of the bundle install step, but the database is still created from scratch. It will be interesting to explore how we can cache an entire database and use that across nodes.
Run only impacted tests: In Rails, it’s typically a requirement to run the entire test suite for every change. But what if a change impacts only a very small part of the code? As a potential solution, we did a quick hackathon project to create a map between all test files and other Ruby files in the code. This map pinpoints which files were touched by a single test. Once the map is generated, we can reverse the map and use that to identify what tests we need to run for a particular change. This project is not yet shipped, but it will significantly reduce the CI cost and time per build for any Rails application.

Conclusion

For the past few months, we at KeepTruckin have been working to make the CI for our Rails application less brittle and much faster. In addition to speed, we have been constantly striving to improve the reliability of our test suite.

Tests can be sped up, but there is no magic bullet. Regardless of the technologies you choose, I hope this post serves as a useful guide for you to improve your own CI pipelines. If you have any questions, I’m happy to answer them on Twitter!

Acknowledgments

This project was a joint team effort and would not have been possible without the Application Architecture and DevProd teams at KeepTruckin. I would like to give a special thanks to Shahrukh Khan, who worked with me on improving our CI and provided constant support during the implementation of these updates.

Come Join Us!

Check out our latest KeepTruckin opportunities on our Careers page and visit our Before You Apply page to learn more about our rad engineering team.