Flaky tests caused by a production bug: fix the flakiness, not the bug

As you may already know, at Doctolib, we are annoyed by flaky tests on the regular. If you are not familiar with flaky tests, i.e. tests which fail randomly, you should read this article. In short, despite the low probability, when run hundreds of times on the CI, flaky tests will cause the CI to fail regularly for no real reason at all.

illustration by Bailey McGinn

We recently tried to categorize all of the flaky tests we have corrected in the past few months. Here are the results based on the last twenty-nine flaky tests we corrected:

  • Test misc: various problems concerning the test alone, for instance a setup with duplicated data or setting a field content to 1,000 chars
  • Test order: asserting equality on randomly ordered lists (more on this problem here).
  • Test wait: not using the correct wait methods (more on this problem here).
  • Unknown: having disappeared forever without explicit correction
  • Production bugs: the cases we will study in this post, representing roughly 25% of flakiness

Based on our experience, we have come to the conclusion that production bugs revealed by flaky tests should not be fixed. Instead, we should adhere to the following process:

  • First, ensure that the bugs are minor and that you can live with them.
  • Following that, make the test deterministic to remove any noise on the CI.
  • Finally, document the behavior and let the PO decide on the correction priority.

In this article, we will illustrate this process based on three flaky tests caused by production bugs.

Bug 1: Do not touch a refreshing modal

At Doctolib, we store patient objects in two separate databases: PostgreSQL and ElasticSearch. ElasticSearch is used when a user wants to search for a patient, and PostgreSQL is used by other endpoints. Only searchable fields are stored in ElasticSearch, so when patient data is returned from a search endpoint, fields such as email are not included.

When a user clicks on a patient search result, email will be missing on the modal since the user is coming from ElasticSearch. But the modal needs the missing fields and an Ajax call will be triggered in order to fetch the patient’s email. During this call, the modal is displayed but user interactions are disabled:

Disabled modal during patient data refresh: email is empty and save button is disabled.

Once the Ajax call is complete, the modal is entirely filled, and buttons are enabled:

Modal after the patient data refresh.

However, we forgot to disable one component in the modal: the sidebar. If the user clicks on one of these buttons before the Ajax call is completed, the action will be executed but with a “partial” patient, missing their email. During a test of patient referral to another doctor (see button: “Adresser ce patient chez un confrère” in the sidebar), the following screen will see that the email is missing and the test will fail.

  • Ensure the bug is minor: Of course no real user will click on these buttons in less than 250 ms and the only real consequence would be to refer a patient with a first and last name but with no email.
  • Make the test deterministic: To correct the test flakiness, we simply added an assertion to wait for the appointment modal to be filled by the ajax call before clicking on the “refer” button.
  • Document the behavior and let the PO decide: We also created a bug ticket, and the bug will be corrected in our standard bug flow. This is the exact same process as if the bug was coming from one of our clients.

Bug 2: Zealous auto save feature

We have an administration page that enables our technical staff to modify some fields related to an agenda. This page autosaves upon modification. Once the save is successful, a flash message “Saved!” is displayed.

Administration page: the “Saved!” message is present (middle-right, in green)

Naturally, we have a test that:

  1. opens the page
  2. clicks on “prevent new patient creation
  3. asserts presence of the “Saved!” message
  4. asserts the field was modified in the database

This test was flaky on the database modification assertion, (step four). After investigating, we discovered that when someone opens the page, an Ajax call is immediately sent to save the fields and this unwanted call also displays the “Saved!” message. Because the “Saved!” message is always present, the test will move on directly to step four every time without waiting for the return of the Ajax call triggered by step two. What can happen is that on occasion, step four will occur before step two finishes which results in test failure. This is an instance of an already described problem seen here.

  • Ensure the bug is minor: The first save is a no-op: nothing is modified. The only impacts are an unnecessary ajax call and a visually strange “Saved!” message.
  • Make the test deterministic: Since the “Saved!” message is always present, we cannot wait for it to appear after the field modification. This is a case where we want to assert a database modification after a user action, but without any UI modifications. So, we simply added a loop around the database assertion. The test changed a bit, and can now be read as “when I modify a field, the modification should be saved in the next 10 seconds”.
  • Document the behavior and let the PO decide: In this case the PO told us not to bother creating a bug ticket, since the page is legacy.

Bug 3: Summertime Sadness

When a doctor wants patients to be able to book online on their agenda, they will create what we call a “recurring opening”. Let’s say their office is open every Monday from 8am to 10am and their appointment duration is 30 minutes. That means that patients will see four available half-hour slots when they try to book an appointment for the following Monday. Here is the corresponding test, that will assert for these slots: (8:00–8:30), (8:30–9:00), (9:00–9:30), (9:30–10:00).

it 'splits openings in slots' do
travel_to(Time.current.beginning_of_week)
  agenda = create(:agenda, slot_duration: 30.minutes)
opening = create(:opening, agenda: agenda, recurring: true,
starts_at: Time.current.change(hour: 8), duration: 2.hours)
  opened_slots =
AvailabilityService.new(agenda).bookable_slots(1.week.from_now)
  assert_equal 4, opened_slots.count
end

However, twice a year in France, the time zone changes from CET to CEST, and back again during “daylight saving time” (DST).

Let’s say it is three days ahead of the change from CET (UTC+1) to CEST (UTC+2). The time is exactly 8am, Thursday, the 22nd of March, 2018. The test will request availabilities starting the following week. The problem with DST, is that next week will be in 6 days and 23 hours!

> Time.current
=> Thu, 22 Mar 2018 08:00:00 CET +01:00
> 1.week.from_now
=> Thu, 29 Mar 2018 08:00:00 CEST +02:00
> 1.week.from_now — Time.current
=> 6 days and 23 hours

This test was flaky during two weeks of every year, each time occurring the week before the DST shift. Instead of returning four slots, it only returned two (9:00–9:30) and (9:30–10:00).

  • Ensure the bug is minor: Fortunately, the issue only arises in production twice a year and results only in the fact that a midnight slot cannot be booked. Let’s be honest, who’s booking a doctor’s appointment for midnight? Most importantly, DST in not meant to take place in Europe anymore.
  • Make the test deterministic: Flaky resolution was very simple: instead of using present day time we time travel to a non-DST week.
  • Document the behavior and let the PO decide: We did not bother to create a ticket.

Conclusion

As noted, 25% of our flaky tests are caused by production bugs and this is the process we follow when handling this kind of flakiness:

  • Ensure the bug is minor: In each of these cases, we could live with the production bugs. If the bug occurred more frequently, it would have been detected in production, or by a (non-flaky) test fail. Such bugs are often triggered by CI concurrency issues that are not realistic within the realm of human usage. Moreover, the severity is extremely low in all of the cases we detected. Frequency * Security ~= 0.
  • Make the test deterministic: This is what we should focus most on: having a deterministic CI. The real pain point is here: whatever the criticality of the underlying bug, a flaky CI is critical.
  • Document the behavior and let the PO decide: Such bugs can be handled by our standard bug workflow.

Test flakiness reduces CI determinism, and without a reliable CI, major production bugs will appear. Test flakiness is more critical than a minor bug in production. That’s why we immediately correct the test but let the bug follow our standard workflow.

Flakiness must be fought against.