9 Steps to an Optimized Ruby on Rails Monolith Pipeline

stephskardal
Upstart Tech
Published in
12 min readFeb 14, 2022
https://www.upstart.com/

For the last couple of months at Upstart, we’ve invested in optimizing our continuous integration (CI) pipeline for our monolith codebase. The awesome cross team effort has included infra folks, platform folks, and software engineers. While we have a number of growing microservices, a large part of our work is in the monolith every day. In this post, I’ll break down the efforts we made to optimize our pipeline and improve our development life cycle.

Background

First, a background: Our monolithic app runs on Ruby on Rails and our CI/CD pipeline runs on Jenkins. Prior to this concerted effort, our CI process took more than 50 minutes to run, triggered immediately by a push to GitHub and concluding at test output report generation. We also send real time failures to Slack. The build runs continuously to ensure stability and reliability as work from many software engineers merges into one monolithic codebase.

A simplified depiction of our Upstart monolith Jenkins pipeline.

Our 50 minute pipeline includes the following:

  • Regenerating dependencies
  • Building a production image in preparation for a deploy
  • Compliance and vulnerability checks
  • Running over 30,000 automated tests (ranging from unit tests to multi-page end-to-end flow tests, via RSpec and Cucumber)
  • Generating aggregated coverage data
  • Reporting coverage and test results

Let’s be real here — while we have a monolithic code base which represents years of complex business logic covered by automated tests, 50 minutes is too long!

While some engineers are successful at context switching between builds, the long build time creates a very long feedback loop, which prohibits iterating quickly. In the event where we wanted to roll forward hotfix a low-impact bug, the build wait time was equal to that of a decent size lunch break. Our ability to exercise continuous deployment has been limited in this 50 minute CI-build world.

Optimization: The 9 Step Journey

On to the nitty gritty details! In this post, I’ll share 9 steps we took to optimize our CI build time to get us from 50 minutes to less than 20 minutes.

The Not So Fine Print Details

Before I dig in, here are a few more technical details:

  • During our build, we have a number of containers that have different dependencies, serving different purposes.
  • The running of the tests is performed on many test pods, or replicated test images. At the time I’m writing this post, we run all of our automated tests on over 200 test images. Each test image (pod) is connected to a single, unique, and replicated seeded database.
  • Approximately 100 of those test pods run RSpec tests and the other 100 run feature tests via Cucumber and Capybara. We run unit tests on a single pod, as those are super speedy with no database connection.

Step 1: Follow the Pain, Measure It

Time Gain From Seeds Pain: ~4 minutes

Part one in the pipeline performance optimization started with the acknowledgement of significant pain. For me, that pain lived in our data seeding.

“Seed” data is a set of mocked data that lives in our database which allows engineers to locally test many parts of our monolithic application. While we also create data within many individual automated tests, we also reuse the seed data for much of our automated testing. The significant pain of multi-minute data seeding was the catalyst for me getting involved in this pipeline work, when I realized that our data seeding took more than 5 minutes.

Our data seeding leverages FactoryBot, a Ruby on Rails gem used for creating database objects. Ruby on Rails 6 now offers a way to do massive inserts quickly via insert_all, or via the activerecord-import gem. Both tools replace many database INSERTs with one INSERT. I made code updates that looked like:

# Before, with many insertsFactoryBot.create(:an_engineer, name: “Zane”)
FactoryBot.create(:an_engineer, name: “Todd”)
FactoryBot.create(:an_engineer, name: “Steph”)
... # more engineers, you can never have enough
# After, with one insert
engineers = [“Zane”, “Todd”, “Steph”, …]
Engineer.import [:name], engineers.map { |z| [z] }

This seems like an easy place to start, right? After this update shaved significant time from the seeds, I examined the remaining data on a fancy spreadsheet to identify opportunities for optimization. The goal of my research was to answer the questions:

  • “Where is there the most pain?”
  • “What pain should be targeted for high impact improvements?”

Zooming out, the greater team began to compile DataDog dashboards to help us identify the largest bottlenecks in our CI process. In addition to DataDog, we also have a number of profiling tools via test-prof, which can be run locally to profile on a per test level. We began to piece together the various bottlenecks of the CI process, targeting reduction of the “longest pole” of each stage.

Example DataDog metrics for optimization and assessment.

This all seems pretty obvious now that we’ve done the work, but the takeaway here was to combine pain driven development with metrics to identify actionable and high impact changes. There are many tools available to instrument and optimize your pipeline should you embark on pipeline optimization.

Step 2: Cache Things

Time Gain From Caching Things: ~3 minutes

This also seems obvious now, but cache things in your CI build that are cache-worthy! If you are a software engineer, you may already know of the term “caching”, which means storing data somewhere it can be accessed more quickly. In the context of Ruby on Rails, caching can mean a number of things.

In our CI build, we identified a number of additional artifacts that could be cached to expedite the build. Cache-worthy artifacts might include objects that do not change within a build or between builds. We extended the existing cached artifacts to include:

  • a base test image, with installed gem dependencies
  • precompiled frontend assets, synced to each test pod from S3
  • seeds, in SQL form, decoupled from Rails
  • test splitting artifacts — “instructions” for each test pod

In our CI build, we leveraged a storage solution for each of those cached items. We used AWS S3, Harbor IO, and stored cached seeds directly in our code repository. The takeaway from this step is to use storage mechanisms available to you to cache artifacts that would otherwise be recreated in each build.

Step 3: Parallelize Things

Time Gain From Parallelizing Things: ~3 minutes

Next, we identified a number of items that could be parallelized. If you are a parent, parallelization could mean, simultaneously:

  • get one kid putting their shoes on and brushing their hair
  • battle another kid to finish getting dressed
  • get third kid to finish their breakfast

In the software world, this means running things in parallel that do not depend on each other.

Simple depiction of moving steps in the pipeline based on the dependencies, or splitting single tests in two and shifting in parallel.

We asked ourselves which steps could be shifted earlier and which steps could be split into two. For our particular pipeline, we were already doing a number of steps in parallel, but the following items were further parallelized:

  • Test splitting: Programmatically determining which tests run on which pods, provided it could be decoupled from Rails as a dependency. For example, on the feature test side, this was done using the gherkin cucumber parsing gem.
  • Test parsing: Test reports were generated in sequence for the 3 types of tests (unit, RSpecs, and feature tests). We were able to split these in parallel to shave off time.

Parallelization is well understood in the computer software world. We were able to apply it further in this project after reexamining our dependencies or removing dependencies.

Step 4: Background Things

Time Gain From Backgrounding Things: ~7 minutes

Step 4 of our journey into pipeline optimization found us backgrounding items that consumed many minutes of the build. Infrastructure work was done to shift the work of generating code coverage and test reports to the background. CI build completion was decoupled from this report generation.

This made sense because these build artifacts were not necessarily required for every build. We also had other signals from the build that made the report output redundant.

The high-level technical solution here was:

  • Generate coverage and test report artifacts during the test builds.
  • Push artifacts to S3.
  • On the “Results” step, kick off a background job that retrieves the artifacts and generates the reports outside of the scope of the build.
  • Send notification to software engineers to access reports upon completion.

Step 5: Remove Things All Together

Time Gain From Removing Things: >8 minutes

Significant gains were found at Step 5, when we decided to remove things all together.

Raw Rails Logs

As noted above in Step 4, we identified a number of build artifacts that were created from each test pod. We identified the redundancy between some of those signals. We found that Rails logs were being generated in each build and sent to S3. These logs were redundant to the test results and have a much lower signal to noise ratio.

Prior to the addition of test reporting tools with better usability, we may have relied on these Rails logs for debugging purposes. Now, we no longer link to the Rails logs in our build, and no one knew terabytes of them existed on S3.

We made the change to stop Ruby on Rails logging in our test pods:

# config/environments/test.rbif ENV[“CI”]
config.logger = Logger.new(nil)
config.log_level = :fatal
end

Since no one knew they existed, no one was upset when they were removed! This change did not apply to local test and development environments.

Passing Test Output

In our build, we leverage a static build of Allure (Test Report), a test report tool, to display test failures in an interactive format. To support Allure, we had added middleware gems allure-ruby, allure-cucumber, and allure-rspec which leverage RSpec and Cucumber callbacks to generate test output for all test runs.

Unfortunately, the middleware was found to be heavy-handed — we were generating test output (as JSON or XML) for >30,000 tests per build, accumulating to more than 300,000,000 single test artifacts per month at our build frequency. These artifacts include passing test output. Software engineers such as myself were feeling this pain on the Allure user interface (UI) side as the UI tried to load all of this static data when all I really wanted to know was which tests failed on my build.

A technical solution was put in place to remove the middleware that generated test output per passing test. Artifacts were generated test output for failures and pending tests only. This update continued to pass test count totals to Allure for display in the UI. This technical solution shaved minutes off the build and minimized S3 file transfer.

A depiction of test report output from allure, showing test totals and pending tests. There were no failing tests in this mock build — hooray!

The ultimate takeaway here is that artifact generation on CI is not free or useful when generating unactionable or unused data.

Additional Items Removed

In addition to the big wins described above, we identified a number of additional things to remove:

  • Eagerly loaded seed data not used in testing
  • Redundant tests
  • Files not necessary on test pods that were already coming from a cache
  • Files not necessary on the test image (via sparse checkout)
  • Database profiling metrics (moved to optional only)
  • Redundant build images

Step 6: Mitigate Rails Loading Pain

Time Gain From Improved Loading: ~2 minutes

A small amount of gains were made to mitigate the known challenge of Rails loading speed. When this project began, we were round-robin dividing all of our tests between pods, and calling multiple RSpec commands per test pod, e.g.:

bundle exec rspec this_test.rb:1 that_test.rb:2 another_test.rb:3
...
# a lot more like this
...
bundle exec rspec different_test.rb:1 another_one.rb:3

The problem with the implementation above is that we were taking the hit of loading Rails with every call to RSpec. In our case, that was a fairly significant cost because our application and dependency set is quite large in the monolith.

To mitigate this loading issue, spring was added to tests to preload Rails. While Spring appears to have some contention in the Rails community around its unpredictable behavior in development environments, we added spring in CI only since the ephemeral test pods would never have changing code.

Additional work was done to stop auto requiring slow loading gems, and require them only in the file they were used:

# Gemfile
gem “slow_loader”, require: false
# class SomeClass
require “slow_loader”
class SomeClass
# do a thing with SlowLoader
end

Step 7: Better Test Pod Load Balancing

Time Gain From Better Pod Load Balancing: >4 minutes

What’s next on our exciting journey of optimizing the pipeline? Better load balancing. In our case, this meant moving away from round-robin test assignment to data driven load balancing based on expected run time.

Below is a snippet of the initial load balancing. From the engineer who wrote this code: “This should be an interview question!”

module PseudoLoadBalancer
Bag = Struct.new(:contents, :total_cost)
#block is a “cost” of each content item
def self.balance(max_number_of_bags, contents, &block)
bags = (1..max_number_of_bags).map { Bag.new([], 0) }
contents.sort_by(&block).reverse.each do |entry|
# add single entry to minimum “cost” bag
end
bags
end
end

We also do dry test runs on all of our tests to divide them up. Updates were made to reduce the dry run dependencies as well. Further iterations on load balancing have been implemented while I’ve been writing this post.

Step 8: Resource Management

Time Gain From Better Resource Management: >1 minute

We’re getting close to the last couple of steps here in our journey! We addressed a few things related to resource management:

  • First, we made a great amount of progress on in-test improvements with better data mocking (e.g. FactoryBot build vs. create). The create method will create an object and save it to the database, while build creates, but does not save the object to the database. A number of very expensive objects were identified and moved to stubbed or mocked data in many of these cases. This work lands in the long-term bucket — we continue to make iterations here.
  • Second, we identified one major bottleneck in that a single database instance was used for 200 test pod databases. We adjusted our build to shift to more database instances and plan to further iterate on this.

At the time of writing this article, we are still learning more about what our pain points are on the resource management side, trying to answer the following questions:

  • What is the most performant database instance and test pod configuration?
  • What is the cost-benefit analysis on this most performant configuration?
  • Are there any low-hanging fruit to address the resource constraints?
  • Can we focus on in-app code performance to alleviate some of these resource constraints?

Step 9: Leave Things Behind

Time Gain From Leaving Things Behind: 0 minutes

After all is said and done, we expect to leave a small number of changes behind, because we’ve reached a point of diminishing returns. We want to shift our focus to high impact, positive change. This list includes:

  • Mock out a service that is only used in some set of tests. For example, I bailed on this work because it did not show significant gains.
  • Shift to a team owned process that supports further in-test optimizations. We want to continue shifting tests left and move to better data mocking throughout.
  • Perform additional tech debt cleanup that is not directly impacting on pipeline optimization.
  • Complete usability improvements that do not appear to cause significant productivity limitations.

It’s normal in the agile world to leave things behind. Any one of these left-behind items might bubble up to a bigger problem in the future.

Conclusion

I hope you enjoyed this journey into a better monolith pipeline! As I’m publishing this blog post, our build run times are hovering around 17 minutes. We expect to see it drop down a bit more with a few more adjustments. To be entirely honest, while the format of this story was a nice 9 step process, the reality is that much of this work was done in parallel and iteratively to land on Step 9.

The reality of our optimization work was in parallel and iterative.

Our engineering team is now experiencing a nearly 3x faster feedback loop in the monolith. For those of us working remotely, we can push code, take the dog out, make coffee, eat a snack, send a slack, and then see what our build looks like!

If you are interested in joining the team at Upstart, we are hiring across the board. We are now hiring remote working engineers and engineering managers (tell them I sent you :)). We have lots of interesting problems to solve in and out of our monolith, and are expanding our technology stack in many exciting ways. Find more at https://www.upstart.com/careers.

--

--