Fixing Flaky Tests Once And For All

Jonathan Block
5 min readJan 31, 2019

--

The average yearly financial losses at big tech companies can easily soar to millions of dollars per year due to unhealthy test flapping when you consider the total developer hours squandered while dealing with unhealthy unit tests. This Medium post takes you through how this happens and how to keep this problem under control.

Modern software delivery virtually commands your code be automatically tested (CI) and automatically deployed (CD) to some degree often with systems like CircleCI, Jenkins, TravisCI, TeamCity, Spinnaker, etc. Large codebases often have thousands of tests and in order to avoid breaking something that previously worked, we like to run unit tests prior to merging to the master line and also before deploying to production.

Organizations with large codebases eventually have a test flap problem. “Flaps” are unit tests which waste everyone’s time because they fail for misleading reasons and cause confusion about the root cause of a problem.

Flaky (aka “flappy”) tests are usually the result of some defect in how a unit test is implemented. As an example, consider a “business hours” unit test that asserts that the real clock’s current hour is within normal business hours. Do you see the flaw? This test will not pass when run late at night.

An unhealthy business hours test like this one might slip through code review because it will successfully pass during business hours - when most code is written and reviewed. At night, when the deploy pipeline is running, that same unit test will fail because it is being run outside of normal business hours. This situation often causes the deploy train to subsequently become blocked. The defect in this example is that the author of the test forgot to properly coerce the system clock to a fixed date and time.

Consider another example - a test that tries to send HTTP requests to the Stripe REST API sandbox. Stripe’s sandbox implements rate limiting. If you send too many HTTP requests too quickly in your tests, you will find that the tests will fail because Stripe’s sandbox will return rate limit errors. In this scenario, the defect is that the Stripe tests should not actually send HTTP requests to Stripe’s sandbox at all. Instead, you want to implement a test mock to avoid HTTP calls and assume fixed known Stripe responses.

You’re busy and you don’t have time to triage other people’s flaky code

Have you ever changed a README.md and and then then a unit test fails? You ask yourself “How could I have possibly broken that?”

Next, you would go investigate the test failure and after a certain amount of time wasted, you conclude the test just must be flaky. Then you realize how busy you are and you proceed to just retry the test with some kind of retry mechanism or you simply rerun your CI in the hopes it passes next time.

Sometimes, one test goes bad and literally the whole team is having that same test fail at the same time and dozens of engineers are all trying to triage the same problem at the same time but they don’t even know it. This is where losses in terms of productivity and company money pile up.

What’s the best way to fix flappy tests?

I have thought about this problem for several years and I’ve devised a general guideline for implementing a flaky test triage process. One option relies on you to solve each step yourself manually, and the second option is a productized solution called Flaptastic.

Option 1: Manually Triaging Flappy Unit Tests

  1. Identify. You need some ability to quickly know that a certain test flaps. Human engineers traditionally are the first to conclude that they have identified a test flap. Unfortunately, this is painful to the developer and expensive to the company. Some teams then manually track flappy tests in Jira or Google Docs spreadsheets — a laborious process.
  2. Disable. As soon as you have detected a flappy test, it should be disabled. This will help stop any further productivity losses to the rest of the team. This usually involves merging a pull request where the test is marked as skipped.
  3. Fix. With the flap details in hand, you can now assign the triage process to your team/service’s on-call engineer. Whereas triaging site outages is an on-call’s top priority, the triage of flappy tests comes second.
  4. Re-enable. With a test flap fixed, the test can be safely reintroduced back into the codebase.

Option 2: Flaptastic

Flaptastic can streamline this process. The only requirement to use it is that your code is on GitHub and you use a CI system like CircleCI, Jenkins etc.

With Flaptastic, the revised flow looks like this:

  1. Identify. It immediately sends GitHub flap status alert into your GitHub pull request page, clearly and immediately identifying when a test flapped. This helps the human software engineer avoid wasting time to triage code that they did not break.
  2. Disable. Flaptastic offers a single checkbox for each test that instantly disables a test across all feature branches / pull requests. In other words, you do not need to create a new pull request to mark a test as skipped — the plugin instantly applies your intention to skip a test for all CI/CD jobs as soon as you click the checkbox.
  3. Fix. Flaptastic will guide your on-call engineer to fix the flaps causing the biggest disruptions first. There are dashboards and metrics available as well as Slack flap alerts, and REST API’s for custom integrations. Ultimately, your on-call engineer will end up on a Flaptastic flap detail page with all of the information they’ll need in order to resolve a flap.
  4. Re-enable. Once a test is reintroduced to the codebase, the checkbox can be unticked and the test will immediately be runnable again for everybody.

I love the art of making flaky tests healthy and would be happy to have a chat with you about your unit test health challenges.

Feel free to give Flaptastic a spin at https://www.flaptastic.com/

--

--

Jonathan Block

CI/CD Automation Engineer. Formerly Lyft API Platform. RideShare startup advisor. Creator of 1st ad server used @ Facebook in ’04. Investor. Dog owner.