It’s always greener on the other pipeline

Jason Brown
Checkr Engineering
Published in
6 min readSep 22, 2021

How we improved CI/CD pipeline reliability with analytics, new cataloging and fixing processes, optimized infrastructure, and a shared knowledge of flaky test root causes.

Checkr’s co-founder Jonathan Perichon wrote in a 2017 blog post:

Long builds slow down deployments and cause undesired context switching. As someone who strives for developer productivity, this is a pet peeve of mine. And I’m sure I’m not alone…

As the size of the R&D organization has grown 6x since then to 200, and annual background checks exceed 30 million, it’s even more important having fast and reliable Continuous-Integration/ Continuous-Delivery (CI/CD) pipelines. While the majority of development now occurs in microservices (Expungements, Assess, HealthScreenings, RegionCompliance, etc.), over a third still occurs in the Checkr Monolith. Indeed in the last year, 3,200 PRs were opened (average of 13 per working day), and the service was deployed to production environments 865 times (average of 4 per working day).

Unfortunately, as the number of unit tests has grown to 27,000, so too has the number of “flaky” tests. A flaky test is one that produces a non-deterministic result, either because of time, ordering, shared state, or another issue. Over some months in 2020, flaky tests caused a majority of Monolith pipelines to require at least one manually retried CI job. This slowed developer productivity and may have negatively impacted response times to production incidents.

With shared ownership of so many tests, how could Checkrs come together to improve the reliability of the CI/CD pipeline?

Discussing pipeline reliability (among other topics) during a roundtable with our CEO

Workgroups

Checkr has the concept of workgroups. From the initial RFC:

A workgroup is a cross-team initiative where members from different teams in R&D work together on a particular focus area. The workgroup could be organized around different purposes that drive positive impact across the organization. The workgroup’s goal is to do exploratory work that is not tied to a particular team’s timeline on product or feature deliverables.

Previous workgroups include API Owners, Quality Engineering, Architecture, and Taskr (our task-orchestration platform based on Temporal). A new CI-tests workgroup was organized with an executive sponsor (Paul Baines) and a group of a dozen or so core members and observers. The workgroup set out to build improved CI/CD Analytics, develop a process for cataloging and fixing tests, optimize infrastructure, and communicate a shared knowledge of how we’ve fixed previous tests.

CI/CD Analytics

One of the workgroup’s first initiatives was building a CI/CD performance dashboard. Developers would periodically ask in the slack #eng channel, “Feels like tests are currently preventing new deployments, are others experiencing this too?” We had built for customers an Analytics Dashboard to “pinpoint a potential issue before it becomes one”, and it was clear one was also needed internally for CI/CD performance.

GitLab API

Checkr migrated from GitHub to GitLab in 2019 (some more discussion of that decision here). To calculate historical CI/CD pipeline reliability, we pulled data from the GitLab API and grouped it by job, job status, and date. This first attempt at CI/CD analytics verified that there was room for pipeline reliability improvement. However, since the data was pulled in batches, the tool was not helpful for realtime CI/CD performance and was eventually superseded by an automated solution leveraging Snowflake.

Room for improvement identified 🎯

Snowflake

Checkr uses Snowflake as a central hub for analytics, insights, and dashboards for powering data-driven decisions. Snowflake is used across Checkr teams including Customer Success, Candidate Experience, Operations, and R&D. It has centralized access to many data sources. For this initiative, GitLab merge, pipeline, and job data was added to our Snowflake data warehouse. It has become the place developers go to see near-realtime CI/CD analytics.

Once we could measure it (in near-realtime) we could improve it.

Datadog

In July 2021 Datadog released Datadog CI Visibility. We signed up for this beta feature. It calculates the “failure rate” and has been helpful for finding and fixing flaky tests. We’re still experimenting with this promising tool.

Cataloging, quarantining and fixing tests

With CI/CD analytics, we were finally able to see all test jobs and filter for failed ones on master-branch. The CI tests workgroup then unveiled a proposed process for dealing with flaky tests.

The lifecycle of a flaky test now looks like this:

  1. A flaky test causes a failed CI test job and shows up in our analytics
  2. It is added to the JIRA ci-reliability support ticket dashboard
  3. Teams prioritize these support tickets in current and future sprints

Infrastructure Optimizations

In addition to cataloging and fixing tests, a related effort has focused on optimizing CI/CD infrastructure. We use Knapsack to split the 27,000 unit tests into 15 jobs which run in parallel in CI in 8–10 minutes. The jobs would sometimes fail from resource starvation when many jobs happened to run on the same Kubernetes host. By bumping up the requested CPU, the correct number of jobs run on the same host resulting in fewer failures. We also noticed spikes in flaky tests immediately after Knapsack auto-rebalancing. We’ve moved to periodic as-needed bucket rebalancing instead.

Infrastructure optimizations have improved CI/CD duration as well. Each unit test job required a database to be created, the db schema loaded, and seeded with records which took over 2 minutes. A cache was added for PRs that did not touch the db directory, which now takes <3 seconds. The pipeline now takes 23 minutes to get to a mergeable PR, and once the code is merged takes 30 minutes over a series of pipeline stages to get deployed to Production environments.

Lastly, Infrastructure is experimenting with and rolling out new tools including Okteto and Pipelines for merged results.

Test Failure Causes

The analytics, process, and infrastructure changes have allowed us to quickly fix flaky tests. What’s difficult about fixing them is that the CI test failures usually aren’t reproducible locally. To build out a shared knowledge base, we’ve added GitLab and JIRA labels to support developers easily filtering to see 50+ prior examples.

There still remain some types of flaky tests for which we have not found a root cause. For the flaky tests where we have found one, Timecop multi-threaded, MySQL ordering, and using shared Factorybot objects have been some of the most common causes.

Timecop multi-threaded

We use the Ruby Timecop gem for testing time-based code logic. Even within a given Knapsack bucket, tests are runs in four threads, sometimes causing Time.now method calls to return unexpected values. The typical fix has been to add a requires_time_freezing method call at the top of the time-sensitive test.

MySQL ordering

Unlike some code languages and frameworks, unit tests within Ruby minitest make database calls. This is good because it tests the application code’s integration with the persistence-layer, but not mocking the persistence layer introduces CI complexity. Some tests were assuming an order of resources returned from the database. MySQL does not guarantee the order of returned resources. The fix for this type of issue has been to add .order(id: :asc) to the query to always specify an order, to fix failed tests like this one:

Shared FactoryBot objects

Checkr uses the Ruby factory_bot gem, rather than fixtures, for creating test data. For this many tests, fixtures would be unwieldy. However, some factories reused objects. Changes in one test would cause assertions in another to fail. The fix has been to reduce instances of shared state between tests.

Journey continues

Despite all this, flaky tests still creep into Checkr’s CI/CD pipeline slowing developer productivity. It’s particularly frustrating when it happens near end-of-sprint! We have a long way to go and probably will never be “done.” The good news is that data indicates the occurrences of manually retried CI jobs are down 90% year-over-year 🚀. We attribute this progress to building CI/CD analytics, developing a process for cataloging and fixing tests, optimizing infrastructure, and communicating a shared knowledge of how we’ve fixed previous tests. Greener pipelines allow us to spend less time getting our code deployed to Production and more time building features that build a fairer future.

--

--