The Case of the Flaky Tests

Published in

Compass True North

6 min readJan 8, 2020

Not sure if the flaky tests are broken or the code is broken

Move Fast, Break Things

At Compass we recently rolled out a new login flow. While we were in the process of rolling out we placed the new feature behind a feature flipper. We wanted to maintain the current functionality until the new flow had been successfully rolled out 100%. This is where our e2e tests, written with Cypress, came in. Unfortunately, we moved so fast in developing this new feature that we left our e2e tests in the dust. The new feature had been undergoing manual testing in staging for several weeks, so we were fairly confident that it was bug free and our e2e tests were just flaky. However, we could not in good conscience release a feature with failing e2e tests, so the flaky tests had to be addressed.

The Usual Suspects

We started off with the low-hanging fruit — making sure our tests were running in the same browser locally and in CI, which required changing from Electron to Chrome. We also disabled CORS for our e2e tests and debounced our most frequent API calls. These three things eliminated most of the errors, but not all.

We turned our attention to the remaining tests. Our first step was adding explicit waits for our various API calls.

This resolved some of the remaining failures, but not all. The last 2 failures were intermittent — they would pass locally and fail on CI. We were unable to duplicate the errors the tests encountered in CI in our staging environment, and QA signed off on the feature after doing extensive manual QA. With no other explanation, we assumed the tests were merely flaky, as UI tests can be.

Having (we thought) exhausted every other avenue, in an act of desperation we turned to the UI developer’s duct tape: adding arbitrary waits. This helped, but not enough — the tests were still flaky, even after adding an unconscionable amount of waits.

At this point, I’d been wrestling with these tests for days and weeks, and consistently green runs had become my white whale. Also, I didn’t want to be answering the question ‘Why do our tests take so long to run’ until the end of time. I removed the waits and started from the beginning again, stepping through the code line-by-line to understand why the tests were failing.

Pride Goes Before a Fall

After much debugging and console.logging, I discovered the flaky tests were due to two bugs — one in the existing code for the current workflow and one in the new code I had written for the updated workflow. We hadn’t ever encountered either bug in manual testing because humans don’t generally type that fast; however, automated testing revealed a couple race conditions where it was possible for the state to get clobbered.

The Workflow (Current)

By default, UI assumes users are of Type A (State 1). When a user types in a valid email address, it kicks off a call to a 3rd party API to check the user’s status. If the call fails, then the user is Type A and nothing changes. If the call succeeds, then the user is Type B and the UI updates to State 2.

The Workflow (Updated)

By default, UI assumes users are of Type A (State 1). When a user types in a valid email address, it kicks off a call to an internal API to check the user’s status. If the call succeeds, then the user is Type C and the UI updates to State 2. If the call fails, then it kicks off a call to the 3rd party API to check the user’s status. If that call fails, then the user is Type A and nothing changes. If that call succeeds, then the user is Type B and the UI updates to State 3.

Complicating Factors

Dynamic state updating (would eventually do away with this in v2 due to the large number of edge cases it created)
Very slow 3rd party API call (we wanted to minimize the amount of time the user spent waiting for the UI to refresh)

Bug 1: It’s Clobbering Time

This was a bug in the current workflow. When a user typed in their email, the 3rd party API call kicked off. If the user was Type A, then they didn’t need to wait for the UI to update so they continued filling out the form. However, when the 3rd party API call returned false, it reinitialized the form fields. This meant when the user submitted the form, the form fields would be empty and the API would return an error.

The Fix

I updated the 3rd party API call to reinitialize the form fields only if they did not already have values when it returned false.

Bug 2: 2 Fast 2 Furious

This was a bug in the updated workflow. When the email a user was typing first qualified as a valid email, we would kick off an API call to our internal API. When that call failed, we would then kick off an API call to the 3rd party API. In the meantime, the user would finish typing their email, so we would kick off another call to the internal API, which would fail again, so we would kick off another API call to the 3rd party API. At this point, the first API call to the 3rd party API would return true and the UI would update and the user could submit the form. However, after they submitted the form, the second API call to the 3rd party API would return and the handle would reinitialize the form state, clobbering the submitted state.

The Fix

We were using a count to lock the 3rd party API call, so I added a check to cancel all subsequent calls if the count had not changed (in addition to canceling the current call).

Peace at Last

While it seemed unlikely a human user would ever encounter these cases, we couldn’t guarantee that. In addition, the 3rd party API calls we relied on were very slow and very frequent, so fewer API calls resulted in better performance overall and more robust code, as well as using fewer resources. In the end, we had tests we could rely on and we were able to release our feature with confidence.

The Case of the Flaky Tests

Move Fast, Break Things

The Usual Suspects

Pride Goes Before a Fall

The Workflow (Current)

The Workflow (Updated)

Complicating Factors

Bug 1: It’s Clobbering Time

The Fix

Bug 2: 2 Fast 2 Furious

The Fix

Peace at Last

Written by Stephanie Trimboli