Fixing Flaky Time Based Unit Tests

EG Tech
Expedia Group Technology
6 min readOct 26, 2017

By Dan Levy

My name is Dan Levy and I work on our iOS team responsible for the Expedia iOS application. On our team we rely heavily on our continuous delivery pipeline to ship frequently to users, and a subset of those checks run when a pull request is created. Over the last few months many team members were experiencing random build failures due to a couple of annoying tests flapping good and bad, seemingly randomly. Flaky tests can kill the effectiveness of your pipeline so it’s a high priority to remove them. To remove these tests I followed a few steps of breaking them down, forcing them to fail reliably by inserting an artificial delay, and refactored the tests to avoid the timing issue. There are many other issues that can cause flaky tests but they are always worth investigating and solving.

So to start, we observed that our pull request build failures were almost always due to a particular test, testDeepLinkWithOriginAndDepartureDateOnly (yes we believe in descriptive test names…). We are big fans of Circle Ci for our iOS builds, so we were getting way too familiar with messages like this:

Circle CI Build report showing a single test failing.

Though retesting usually solves the issue, these flaky failures increase the amount of time pull requests take to go green, require human intervention, and cause confusion for developers trying to figure out what failed. After a few months of seeing these failures and doing some pre-work with Brennan (another member of our team), I decided it was time to remove the flakiness from these tests once and for all. Below was my methodology.

1. Break the tests down for increased readability.

I first noticed that the tests were testing many different requirements in a single test. Testing that a deeplink’s origin and startDateComponents are parsed correctly are separate requirements and thus thought they should be tested separately. If both requirements are tested in the same test, a failure of one of the requirements will cause the test to break, which can make it more difficult to figure out which requirement is broken. The test is also harder to read.

func testOneWayDeepLink() {
let url = "expda://flightSearch?origin=MSP&destination=BOS&departureDate=2013-11-05&numAdults=2"
let searchFormData = EBSearchFormData(fromURLString: url)
let actualFlightSearchFormData = FlightSearchFormData(formData: searchFormData)
XCTAssertEqual(actualFlightSearchFormData.origin, AirportSearchParams(searchText: "MSP", location: MockLocation(name: "MSP")))
XCTAssertEqual(actualFlightSearchFormData.destination, AirportSearchParams(searchText: "BOS", location: MockLocation(name: "BOS")))
XCTAssertEqual(actualFlightSearchFormData.startDateComponents, getDepartureDate(givenDateComponentsFromString("2013-11-05")))
XCTAssertNil(actualFlightSearchFormData.endDateComponents)
XCTAssertEqual(actualFlightSearchFormData.travelerSearchInfo, TravelerSearchInfo(adultCount: 2, childAges: []))
XCTAssertEqual(actualFlightSearchFormData.roundTrip, false)
}

Thus, to start I broke all of the tests down into smaller, more focused and manageable pieces. I also could now give each test a better name. Here are a few of my broken down tests. The single test above became three tests.

func testThatOriginIsParsedProperlyForOneWayDeepLink() {
let url = "expda://flightSearch?origin=MSP&destination=BOS&departureDate=2013-11-05&numAdults=2"
let searchFormData = EBSearchFormData(fromURLString: url)
let actualFlightSearchFormData = FlightSearchFormData(formData: searchFormData)
XCTAssertEqual(actualFlightSearchFormData.origin, AirportSearchParams(searchText: "MSP", location: MockLocation(name: "MSP")))
}
func testThatDestinationIsParsedProperlyForOneWayDeepLink() {
let url = "expda://flightSearch?origin=MSP&destination=BOS&departureDate=2013-11-05&numAdults=2"
let searchFormData = EBSearchFormData(fromURLString: url)
let actualFlightSearchFormData = FlightSearchFormData(formData: searchFormData)
XCTAssertEqual(actualFlightSearchFormData.destination, AirportSearchParams(searchText: "BOS", location: MockLocation(name: "BOS")))
}
func testThatStartDateComponentsAreParsedProperlyForOneWayDeepLink() {
let url = "expda://flightSearch?origin=MSP&destination=BOS&departureDate=2013-11-05&numAdults=2"
let searchFormData = EBSearchFormData(fromURLString: url)
let actualFlightSearchFormData = FlightSearchFormData(formData: searchFormData)
XCTAssertEqual(actualFlightSearchFormData.startDateComponents, getDepartureDate(givenDateComponentsFromString("2013-11-05")))
}

2. Get the tests to fail by using sleep().

Now that I had the tests broken down I needed to figure out a way to force the flaky tests to fail. The frustrating part of flaky tests is that they are flaky and that makes them very difficult to debug. If you can force them to fail every time, fixing them is usually the easy part.

From some earlier investigation I knew that the tests failed due to two NSDateComponent objects not being equal. Brennan and I had discovered that the production code was grabbing the current time to create one NSDateComponent object. Shortly after, our test code grabbed the current time again to create the control NSDateComponent object we would test against. Usually they were equal, but if the test crossed a second boundary then the second components would not be equal and the test would fail.

In order to make the failure reproducible, I inserted a [cci lang=”swift”]sleep(1)[/cci] into the tests between the critical points. This worked wonders. This test would now fail every time, since the components created in an earlier part of the test were different than the components created later in the test.

If you are experimenting flaky tests due to time, I highly recommend this method to force a flaky test to fail. Adding short delays can sometimes highlight all sorts of race conditions or timing issues.

func testThatStartDateComponentsAreParsedProperlyForOneWayDeepLink() {
let url = "expda://flightSearch?origin=MSP&destination=BOS&departureDate=2013-11-05&numAdults=2"
let searchFormData = EBSearchFormData(fromURLString: url)
let actualFlightSearchFormData = FlightSearchFormData(formData: searchFormData)

sleep(1)

XCTAssertEqual(actualFlightSearchFormData.startDateComponents, getDepartureDate(givenDateComponentsFromString("2013-11-05")))
}

3. Simplify assertions to assert only data.

One piece that jumped out at me in my failing test was the right side of the XCTAssertEqual statement:

getDepartureDate(givenDateComponentsFromString("2013-11-05"))

This line was particularly confusing, because it masked what our actual expectations were for the startDateComponents. Those two chained methods could have done anything and didn’t make clear what the actual expected value was. After some investigation, I learned that those two methods were essentially copies of production code that returned the current date components if the date components created from the string were in the past. Our production code should do the thinking, not our test code. Knowing this information, I could rewrite the test more simply and with a more specific name:

func testThatStartDateComponentsInThePastGetBumpedUpToCurrentDateForOneWayDeepLink() {
let expectedStartDateComponents = NSCalendar.gregorianCalendar()
.components([.Year, .Month, .Day], fromDate: NSDate())

let url = "expda://flightSearch?origin=MSP&destination=BOS&departureDate=2013-11-05&numAdults=2"
let searchFormData = EBSearchFormData(fromURLString: url)
let actualFlightSearchFormData = FlightSearchFormData(formData: searchFormData)

sleep(1)

XCTAssertEqual(actualFlightSearchFormData.startDateComponents, expectedStartDateComponents)
}

4. Extract Time to Fix the Flakiness.

In order to fix the tests, I had to extract the current date out as a dependency, so that I could control it. This involved finding out where we were generating the current date and passing it in as a parameter to the initializer. This allowed me to have absolute control over the test, no matter how long it took. After extracting the date and using the exact same date object for injection as well as for verification, the test passed.

The other added benefit is that when someone needs to make changes to this production code in the future, it easier to write new tests and easier to change existing ones. This is the definition of increased testability.

func testThatStartDateComponentsInThePastGetBumpedUpToCurrentDateForOneWayDeepLink() {
let expectedStartDateComponents = NSCalendar.gregorianCalendar().components([.Year, .Month, .Day], fromDate: NSDate())

let url = "expda://flightSearch?origin=MSP&destination=BOS&departureDate=2013-11-05&numAdults=2"
let searchFormData = EBSearchFormData(fromURLString: url, now: NSDate())
let actualFlightSearchFormData = FlightSearchFormData(formData: searchFormData)

sleep(1)

XCTAssertEqual(actualFlightSearchFormData.startDateComponents, expectedStartDateComponents)
}

Once my test passed, I deleted the [cci lang=”swift”]sleep(1)[/cci] call and opened a pull request. This technique of extracting a date parameter initialization doesn’t just solve timing delays or race conditions, it can be used to solve a variety of time-related intermittent unit test failures.

More Flaky Tests

With the biggest culprit out of the way, other flaky tests surfaced.

Circle CI report showing 2 more test failures.

Following the same methodology for those tests solved those as well.

In this case, a simple delay highlighted the problem with the test. Another very common source of intermittent failures are tests that fail when they run:

  • around midnight
  • at the end or beginning of the month
  • at the beginning or the end of the year
  • when run after (or before) a specific hard-coded date in the test
  • around or during a daylight savings switch
  • around a leap year.

If you have date specific logic you want to be able to pass in the date anyway so you can test all these conditions, so it pays to extract and centralize references to “now” in your codebase so you can control it in tests.

Exercise: Now you try it!

As a challenge to readers, go out into your codebase you work on, find a test that could be improved, and fix it. It doesn’t have to be a flaky test, it can just be a small tweak to your test code. While we often think our production code is most important, our tests make sure our production code works and deserves the same, if not more love. It’s very easy to end up with large amounts of duplication in your tests over time if you don’t constantly improve and refactor, which then in turn discourages people from refactoring and improving your production codebase. Some possible improvements are:

  • Improve the name of the test to describe what it actually tests. E.g. testThing() becomes testThatThingDoesThisWhenBlahHappens().
  • Fix a flaky test.
  • Make a test easier to read by breaking it down into smaller tests — one for each requirement.
  • Find an asynchronous test and make it synchronous to reduce potential flakiness and build time.

--

--