Rob Esris
Thirty Madison Engineering
5 min readMay 3, 2021

--

Order update failed. Validation error: [“User does not have a prescription for: [\”Finasteride, 1mg pill, 90 count\”]”].

The all-too-familiar words stared tauntingly back at me, a glaring red-on-black blemish on my otherwise verdant CircleCI build. Like any developer worth their bits, flaky tests irritated me on a level roughly on par with rolling around in a bed of fresh fiberglass. Although, I would hasten to add, it’s a testament to the overall reliability of Thirty Madison’s rspec test suite that this one flake stood out so glaringly. Every week or so it would appear in a CI build and then generally disappear when the build was re-run. In the grand scheme of things it was a minor inconvenience, but the wasted time added up, and I really wanted to understand — and then mercilessly destroy — it.

I materialized in the Keeps engineering Slack channel in a cloud of low-budget CGI sparks. “I say, folks,” I began, tipping my fedora politely, “it’s that flake again.” Everyone stopped what they were doing and turned to me in horror, and I think somebody dropped a tray of glasses. It was the same reaction you see in a B movie when someone shows up to tell everyone that the unpleasant axe-wielding gentleman in the hockey mask is, in fact, still alive and roaming the woods outside. At least that’s what I assume was going on; this was all over Slack, so I had to fill in the visuals myself.

The Keeps team and I had been down this road before and thought we had fixed the issue, but it kept cropping up. There wasn’t anything particularly complicated or asynchronous going on in the spec, so a race condition didn’t seem a likely cause. Our CI build wasn’t configured to randomize the run order of specs in a given file, either, so that ruled out a type of order-dependence that would have otherwise been a good place to look.

What remained was the possibility of order-dependence across whole spec files. Our CircleCI build splits the files of our large test suite into 8 parallel builds, which it tries to balance based on historical runtimes:

TESTS=$(circleci tests glob “spec/**/*_spec.rb” | circleci tests split — split-by=timings)

There is a degree of randomness in that the set of tests each of the 8 parallel builds runs can vary from build to build.

An earlier theory we had was that some data might be getting loaded into the test database and not cleared out properly, causing there to sometimes be duplicates of a certain record. If this theory was correct, the flake was appearing when we pulled out the wrong duplicate during the test. We had tried tweaking one of our factories with this in mind, and we thought for a while it had solved the issue, but the flake kept appearing.

After helping to clean up the broken glass in the Slack channel, I made my exit, determined to get to the bottom of this puzzle. I’d just recently learned of the rspec’s handy — bisect option, and, already being a big fan of git bisect, I decided to unleash it on our test suite.

rspec — bisect is a tool for surfacing order-dependent failures in rspec suites. As the name suggests, it will repeatedly split the examples in the files you feed it, running them until it ultimately belches out a one-line command that runs the minimal set of examples that will reproduce the failure.

So I tried running it on the subset of files that produced the failure on the CircleCI build:

bundle exec rspec — bisect <FILE_NAMES>

The problem here is that <FILE_NAMES> comprised several dozen files. It quickly became apparent that the process wouldn’t complete in a reasonable amount of time, so I had to figure out a way to narrow down the set of files to investigate.

Fortunately, in this case, the file containing the failing spec appeared close to the start of the list of spec files run in the build. I could assume that the spec files run after the failure were irrelevant, so this narrowed down the set of files to only about ten, a much more manageable number.

This took several minutes to run, but when it finished it came back with this:

The minimal reproduction command is:rspec ‘./spec/integration/keeps/update_treatment_flow_with_failed_invoice_spec.rb[1:2]’ ‘./spec/models/product_spec.rb[1:41:2:4:1]’

Now I was getting somewhere. This command runs specific examples in the two spec files, which should have reproduced the failure.

The thing is…it didn’t.

Instead, when I ran this command, the specs passed. After weeping into my scotch (neat) and blowing my nose into a handkerchief embroidered with the PHP logo, I decided to run the same command, but I reversed the order of the files:

rspec ‘./spec/models/product_spec.rb[1:41:2:4:1]’ ‘./spec/integration/keeps/update_treatment_flow_with_failed_invoice_spec.rb[1:2]’

Incredibly, this worked. It caused the exact failure we had been seeing in the test suite and vastly narrowed down the places we needed to look for the cause. The matter of why the file order needed to be reversed was a more mysterious conundrum, but that’s for another blog post.

At this point I was unfortunately called away on top-secret Cove business and so the half-cracked case sat half-cracked, as it were.

A few days later, my friend and ally Nick (also of team Cove) was looking into some flakes as part of one of his tickets and — lo and behold — the infamous Finasteride Flake was one of them. He had noted that he suspected flipper to be the culprit behind the flake. After initially chastising him for placing blame on a surely innocent cetacean from the 1960s, I realized he was actually referring to flipper, the ruby gem that we used for managing feature flags. It used redis to —

Redis.

And there it was. For the uninitiated, Redis is (in their own words) “in-memory data structure store.” We use it for caching data, and `flipper` uses it under the hood to store and retrieve flag values. As Nick astutely pointed out, order-dependent flakes are generally the result of some global state that isn’t cleaned up properly. And sure enough, in the product_spec.rb file, there were multiple spots where we were setting flags without bothering to clean them up afterwards. In a test environment, we’re actually using an in-memory hash rather than Redis for storing flag state, but the principle remains the same. Whereas database changes made in tests are rolled back by the test framework, there was nothing in place to clean up flipper flag state from Redis automatically, so any changes made in one test would be present for subsequent ones.

In the end, the solution turned out to be pretty easy. I clothed each of the previously naked `flipper` calls in a with_flipper_flag method we had lying around. This took care of setting each flag back to its initial value at the end of the example.

Nick and I nodded solemnly to each other, and I pressed the button to deploy the fix. We walked away, casually putting on our Ray-Bans as the last flake in our test suite exploded dramatically behind us.

We didn’t look back.

--

--