Hunting Intermittent Tests

Dave Dash
Dave Dash
Dec 13, 2015 · 5 min read

There’s a sad time in testing land when you have a test suite that you can’t rely on. Instead of running a suite and merging if it passes, you have to glance over the failed tests and say “Oh that one… that test is flaky.”

Then it becomes a class of tests, “oh we can ignore Elasticsearch tests.” Eventually we stop trusting the build and just merge it. Because it’s probably the Elasticsearch test that failed. At least I think that’s what my email said:

The builds all start to blur together.

Enter the Huntsman

The Huntsman

So I decided to hunt down intermittent tests. Qualitative judgements are hard to deal with:

  • The test suite is unreliable.
  • Elasticsearch is flaky
  • My cat is mean to me

While all of these statements might be true, it can be hard to fix them and actually feel like your fix did anything.

Measure Up

Here’s how I did that:

We do a few things:

  1. Check if we’re busy building other things.
  2. Trigger the build
  3. While the build is triggered see and log the status of the last build to Datadog.

The result of step three is this chart.

Orange builds are bad, blue builds are good.

Sometime after noon our team fixed some tests and it made things happen less frequently.

Test just what you need

We ended up taking creating an index, but never adding anything to the index. That’s what ended up throwing us off. We used some well placed testing doubles to eliminate the need for Elasticsearch:

dbl = double(:total => 0, :empty? => true, :map => nil)
allow(Search).to receive(:perform_search).and_return(dbl

Our test never looked at the results, it just verified that certain rendering attributes happened.

Test Ordering

I ran into similar issues at Pinterest when I massively parallelized the test-suite. At Pinterest we had no database backing for our tests, it was an issue with mocking and clean-up. In this case we were dealing with artifacts leftover in the database.

rspec left a clue:

Randomized with seed 7918

I could run

rspec --seed 7918

to run the tests in the same order.

Combined with the bisect option I was able to narrow the set of tests by quite a bit, but there was an issue with bisect. Much of our intermittency now was caused by data remaining in the database between test runs. These never were cleaned up, even after test-runs. So bisect served as a guide.

So fresh and so clean

I upgraded the gem and followed the suggested method of using config.around to always run tests using the DatabaseCleaner.cleaning context. This looked clean, but many things broke because upgrades. Once we cleared out the issues we had a build that’s been fairly clean for the last four hours.

Competing Philosophies

So the world we live in means tests need to play nice. Right now the road we’ve taken is that each test assumes a clean state, and relies on the previous test to clean up after itself. This is fine, but it can make for a brittle test suite.

A competing philosophy might be one that requires us to clean up before we run our test as required. For example, a test that states:

  • Adds two users, Trish and Hope to the system
  • Verify that in a list of all the users only Trish and Hope are present

Can only work if we assume a clean state. In a world of badly behaving tests that leave artifacts around we might need to add a step:

  • Remove all users.

The tradeoff is that we pay a cost to remove all users even if there aren’t any users to remove.

We’ll stick with the former philosophy for now until it becomes unbearable.


There really isn’t any magic with intermittent tests. All of our failures had logical reasons for failing. If you are plagued with this issue, the first step is to collect metrics around your tests, just as you might collect business or operational metrics.

After you collect data you’ll need to look at each test individually. The issues I’ve run into were:

  • Testing systems which weren’t warm (Elasticsearch)
  • Database not being cleaned
  • Test ordering being confusing.
  • Mocks not clearing themselves.

Good luck.

If you liked this post you might also enjoy this one:

Dave Dash

Written by

Dave Dash

DadOps 24/7 and DevOps Consultant. Formerly @Pinterest and @Mozilla