Continuous improvement of our flaky tests hunting process

Published in

Doctolib

4 min readJan 3, 2019

Last year, I discussed how flaky tests are problematic and why we should hunt them. This year, we improved this process.

In the first version of our process:

We decided upon the following actions to ensure that they stay under control:

We implemented a retry mechanism: if a test fails, our CI retries it up to 3 times. If one retry is successful, that means the test is successful but flaky.
If a test is detected as flaky we send an event to Sentry. That way we have a dashboard listing flaky tests in our standard error tracking tool.

Every day our “duty guy” checks the dedicated Sentry project. Each flaky occurring more than 3 times a day is assigned to be reviewed by the team that was most likely to have caused it.
The assigned team then has 2 weeks to fix the issue.

First improvements:

The retry mechanism worked well and allowed us to mitigate the problem.
The Sentry project was correctly filled in, and was very useful in tracking and debugging these flaky tests.
Having the “duty guy” in charge of assignation ensured that the flaky tests were supervised, and that somebody cared about new occurrences.

But this process also had its flaws:

The team assignment was done based on code namespaces and did not reflect the real source of the problem. Most issues were assigned to our “DOC” team who are in charge of the doctor calendar despite the fact that many other teams also frequently make small modifications to the calendar.
The “duty guy” cared and complained about flaky tests, because for an entire day he felt the impact of every single failure, however, when he eventually returned to his team the burden became diluted.
No time or regulation was officially designated to allow the teams to correct flaky tests. Since new features and real users’ bugs always took precedence, little by little, our flaky tests piled up.

The straw that broke the camel’s back occurred in October.

30% of our builds on our master branch failed because of flaky tests. It was then we decided that we needed to take action.

Failed / Successful builds on our master branch, per month

Our first short term goal was to reduce the flakiness quickly. Over the course of a week my focus was to pair with developers across all teams in order to tackle the 20 biggest flaky tests as quickly as possible. Roughly 10 were successfully corrected (helpful developers were rewarded with a chocolate bar). Half of the flaky tests were caused by real production bugs with a low probability of occurring and very low impact. We worked around them and documented them as we went. The other half were mostly instances of previously ascribed common causes.

We returned to 17% of failed builds. It was better, but not sufficient. What’s more, this was a one time approach to the issue; if we did not take further action, flaky tests would continue to pile up.

So, to prepare for the long term:

We decided to adapt the existing workflow to fix the flawed steps.

Our definition of a harmful flaky test has been relaxed: we now only treat tests failing more than 10 times a day. This threshold is purely empirical and will probably be refined in our next iteration.
Since it’s very hard to designate a culprit, we now share the pool across all teams; the next developer assigned to work on it takes on the first one on the list.
We created a reserved time slot to fix these harmful tests. Each team (roughly 5 to 10 devs) must dedicate one developer for one hour a week to pair with me and try to fix one test. This works out to about 6 hours per week.
We defined a KPI with a goal: % of failed builds on master must stay under 10%. Flakiness is our main fail source and is closely watched thanks to the following PeriscopeData graph.

Percentage of failed builds on our master branch

Further thinking:

For now, I am the only one keeping an eye on this statistic and organising the 1 hour per team to tackle these problems. Our next step will be to determine how this process can be handled so that I am no longer necessary.

Who/how can this KPI be watched so that it never increases again?
Can teams work one hour every week without me having to poke them on mondays to book a slot?
How can we enforce good practices to prevent so many flaky tests being written?
Can we educate our new joiners through a Doctolib Academy?

At Doctolib, we value pragmatism. We know that flakiness is inevitable no matter how much time is invested on it. That’s why we do not want to invest too much time at once. Instead, each time it becomes unmanageable, we repair it and improve the process to ensure that this specific issue will not arise again.