How Airtable Manages Flaky Tests in a Large-Scale Monorepo

Thomas Wang
The Airtable Engineering Blog
8 min readJan 12, 2023

By: Airtable’s Developer Effectiveness and Quality Engineering Teams

Airtable serves critical use cases for large enterprise customers. Our customers depend on Airtable to be reliable in order to power their businesses. In order to deliver our customers a high-quality product, Airtable’s engineering team writes tens of thousands of tests, including unit tests, integration tests, and end-to-end tests. There are far too many tests for developers to run them all locally, so we leverage Continuous Integration (CI) infrastructure to run all of the tests whenever an engineer submits a pull request. We also use a merge queue to prevent regressions from landing in the main branch.

But in addition to maintaining a high-quality product, our developer infrastructure also needs to ensure high developer velocity, because our customers also expect us to continue delivering new features that will enable them to leverage our Connected Apps Platform for new use cases. One of the ways that we move fast is by putting almost all of our code into a single, large-scale monorepo. There are lots of benefits to our monorepo, but one of the drawbacks is that flaky tests slow down the entire development team. Flaky tests are those which can both pass and fail without any code changes. We’ve developed an approach to flaky tests that unblocks engineers while also ensuring we close the loop on finding and fixing regressions.

Some engineering organizations hold a firm commitment to ensuring that every single test is deterministic. While we agree that deterministic tests are of higher quality, we also take a pragmatic approach and accept that this is a hard property to guarantee, especially in a large-scale monorepo like the one that Airtable is maintained in. We see value in end-to-end integration tests, including those that depend on timing, because they exercise more of the actual code that will run in production (as opposed to only mocks or fake implementations, for example).

Enable developer velocity while keeping tests reliable

Developers rely on CI to understand code change correctness. There are two requirements for CI to enable developer velocity:

  1. CI result needs to be correct: flaky test failure erodes confidence in the test suite.
  2. CI needs to be fast: waiting for a long-running build breaks the flow.

On the infrastructure level, we’ve done several things to reduce test flakiness:

  1. Each test file runs in the same docker container, and different tests that run in parallel are isolated from each other.
  2. CI leverages each test’s historical CPU utilization to allocate CPU cores to a docker container to make sure a test does not start to become flaky because of CPU throttling. CI is also able to run builds faster by effectively utilizing the host machine to run tests in parallel.

Given that we accept that tests can be flaky, we take some very basic steps to maintain a good trade-off between detecting regressions and enabling developer velocity:

  1. CI retries test failures up to 3 attempts (subject to an overall timeout) in order to mitigate the impact of flaky tests on developer velocity.
  2. We maintain a policy to quarantine any tests that meet a threshold for being too flaky. While we still run quarantined tests (so that we may monitor their reliability over time), they do not impact the result of the build. The quarantine policy is necessary to keep every developer’s builds unaffected by known flaky tests while the test owner is working on a fix.

The Developer Effectiveness team has been responsible for keeping the main branch healthy by quarantining flaky tests. We’ve built tools and processes to help with managing flaky tests. In the next sections, we’ll discuss how we evolved our system and processes to manage flaky tests as we grow.

Run periodic builds to detect flaky tests

To quarantine flaky tests, we first need a better way to detect which tests have failed in an unreliable way, instead of relying on developers to report flakiness when their pull requests fail.

We set up a continuous build pipeline off the main branch to detect flaky tests. This pipeline runs the full test suite and reports any test retries and failures to an analytics platform. We build daily reports that highlight the flakiest tests over a recent period of time. We then quarantine the tests by submitting code changes and merging them into the main branch (later sections discuss how we eventually automated this process). Each test file in our monorepo codebase has an owner team, we have tooling support to tag the owners to triage and fix the test reliability issues.

Confirm tests are flaky with stress tests

While tracking test reliability over time can help us detect flaky tests, we have noticed that test reliability can change quickly. The most common reason is due to code changes (for example, new race conditions, or fixes for them). Another reason could be due to a temporary issue with a test dependency. When making decisions about whether or not to quarantine a test, it is not practical to wait for additional data from more full periodic builds.

To facilitate this, we have built the ability to stress test a test case, in which we will run the same test in our CI infrastructure hundreds of times and report the overall pass/fail rate. We use stress tests in our quarantine policy. If a test case fails too many attempts in a stress test (our threshold right now is 5%), we quarantine it. Developers can also trigger stress tests on their own feature branch to see if their code changes make a test more reliable.

Automate the process

The quarantine tooling described above worked for a while, but required a developer to write a pull request to add or remove a test from the quarantine. This manual step led to two undesirable outcomes:

  1. Our overall build reliability suffered due to delays from when our analytics detected a test was flaky and when a person took action to quarantine a test.
  2. Tests languished in the quarantine list, even after they became more reliable.

Once we had confidence in our quarantine policies, we automated the process of adding and removing tests from the quarantine list:

How a test is transitioned between different states
  1. Whenever a test fails on the main branch build, the automation system triggers a stress test for that test case.
  2. If there are too many failures in the stress test, the automation system adds the test case to the quarantine list.
  3. Every CI build loads the latest quarantine list on start, and suppresses the failures if the testcase is in the quarantine list. The quarantine list contains the git commit hash when a test starts to fail, and which git commit it is fixed on, so CI would not suppress failure incorrectly. CI still reports the pre-quarantine test result of the build to the automation system.
  4. The automation system keeps monitoring the reliability of quarantined tests in the main branch build and removes the test case from the quarantine list if there are sufficient numbers of builds without that test failing. We used acceptance test statistical analysis to calculate the number of required passing builds to un-quarantine the test.
Flaky tests failures within the quarantine range do not impact feature branches

Hold test owners accountable for fixing flaky tests

It is not enough to just quarantine the flaky test, the test needs to be fixed by the owner so we can keep high test coverage. We’ve observed that once developers are notified about the flaky test failure, everyone is very responsive in root causing and fixing the issues.

A key strength of Airtable is the ability to share data across the organization in a very flexible manner. We sync the quarantine list to an Airtable base and build Interfaces so that teams can easily visualize the quarantine list, and how it has changed over time.

An Airtable Interface we built to visualize the list of quarantined tests

We use Automations to periodically publish a digest of quarantined tests to the Slack channels of each team that owned a test that was in the quarantine list. These notifications bring teams to the interfaces which contain all the information they need to debug why their test is flaky, and how to stress test their fixes with the aforementioned stress test tooling.

This is one of many ways that we internally use Airtable to drive our own workflows to effectively connect many teams working towards the same goal.

Our results

Since enabling this automation, we have seen around 50% of all the passed builds include at least 1 quarantined test failure (which would have been retried and potentially fail the build). We’ve eliminated the need for manual intervention to manually quarantine flaky tests, which means the newly introduced flaky tests are now quarantined within an hour instead of more than a day. At the same time, the developer perceived build failures due to flaky test has dropped from roughly 5% (with manual oncall responses) down to less than 2%. The new quarantine approach also has the advantage of preserving the actual failure reasons instead of skipping the test execution, so developers can investigate much more easily. We’ve also seen good engagement from engineers to act on their flaky tests, thanks to our Airtable Interfaces and Automations to publish digests to Slack.

Future work

The system has been working great for us since its launch. That said, it’s still not perfect, we plan to improve the user experience and efficiency in the near future:

  1. Faster stress test so that the system can determine test status more quickly.
  2. Test attempt failure reporting can happen in real-time to improve the time-to-quarantine.
  3. Better culprit finding support and root cause analysis to help developers fixing the issues.
  4. IDE integration to highlight flaky tests.
  5. Generate insights of frequently quarantined tests to facilitate in the code reviews tooling.
  6. Being able to differentiate new errors in the same test, e.g., by using error stacktraces.

Conclusions

Flaky tests are unavoidable in a large monorepo codebase. We’ve found quarantining flaky tests is an effective approach to enable developer velocity while keeping the codebase healthy. Managing quarantined tests requires insights from the data and tooling to automate the entire process, sharing the data with owner teams can help connect everyone to the shared goal.

--

--