Flaky tests

6 min readApr 27, 2018

Introduction

60% of the time, it works every time. Does that make sense? What if I told you that tests in a build pass 95% of the time? This happens because part of the test or production code has a non-deterministic outcome.

How to remove flakiness

The most basic approach is to rerun the whole build or pipeline. Test frameworks like TestNG have a feature called IRetryAnalyzer to implement the logic of retrying a single test. Another option provided by TestNG is the ability to run a test N times and pass if the success rate is above the declared value. All of this is like putting lipstick on a bulldog.

At this point you may have a question, “Is it reasonable to fight with flaky tests, if the mitigation of a failing build is so easy?” Let me ask this question instead, “Are you sure that your test is flaky or do you have a bug in the code?” To answer this question we have to analyze the root causes of flaky tests. To get a flaky test result, we have to run our test multiple times. With the results of these tests we can begin to analyze potential causes.

Running your tests until the test fails. This can be achieved easily by running the build command until failure with a script like “mvn test && mvn test && mvn test…….”. The last test reports can be useful.
Run a single test multiple times. It can be achieved with annotation, for example @Test(invocationCount=100) in TestNG.
Use your IDE. Intellij allows to run a JUnit test N times or until failure.

What is missing

You may say, that you have enough options to pinpoint the reason for a flaky test. Running the build tool multiple times is universal and it can be used with maven, gradle or sbt. Test reports are available for analysis — the hard parts starts here. The test is flaky because the code doesn’t always return the same result. Finding the reason is like leading an investigation with a clue but without evidence. It would be a little bit easier if:

We know how often the test fails and what was the error message.
We can correlate the failing rate with introduced changes.

To address the above points we can create scripts and automate part of the process.

Proposition — sbt plugin

Build tools have the ability to extend functionality with a plugin. I have taken this opportunity and created a plugin for sbt, called sbt-flaky, to automate running the test multiple times, analyze test reports and correlate results with recent changes.

The goal of the sbt-flaky plugin is to give an sbt command to run a test N times, for a specified duration or until it’s first failure. Failures have to be aggregated by the same root cause but error message can differ. The plugin can be used during development on the developer machine or on the build server in a pipeline.

What is in a report

The sample HTML report for a demo project can be found here. As mentioned above, the plugin tries to find a root cause and aggregate result. Let’s see this example. Our test, fails two times with the error messages shown below.

assertion failed: timeout (146430510 nanoseconds) during expectMsg while waiting for Joined(List(test1))
assertion failed: timeout (137785483 nanoseconds) during expectMsg while waiting for Joined(List(test1))

We expect to receive a message but we get a timeout. Simple aggregation by error message will not be useful because the error message is different. Instead of grouping failing test by messages, the plugin groups by stacktrace and shows the first non-test framework element:

at actors.ActorScatterGatherSpec.$anonfun$new$3(ActorScatterGatherSpec.scala:38)

Additionally, error messages are processed and a common part is extracted, differences are replaced by “_”. In our case the common part of the message will be presented as:

assertion failed: timeout (1________ nanoseconds) during expectMsg while waiting for Joined(List(test1))

Of course full stack trace for every failing test is available.

The next useful feature is historical analysis. Results of previous test runs are used to visualise how failure rate was changing over a time. Example below shows that test had serious issues in build 7, but after a few builds it was stabilised.

Example below shows a test with an issue

Additionally it is estimating what is the current probability of a successful build:

Another useful feature is showing a commit between builds. Links are created to show differences introduced between builds or in single commits. Links are generated for Github, BitBucket and Gitlab hosted on premise.

Non deterministic failure of tests, makes analysis of root cause really hard. Having data about stack traces and introduced changes is often not enough. Sbt-flaky can backup log files from every test run. It can be used for postmortem analysis.

A little bit of theory

In white paper “An Empirical Analysis of Flaky Tests”, Qingzhou Luo, Farah Hariri, Lamyaa Eloussi and Darko Marinov described result analysis root causes of flaky tests in popular open sources project. Methodology was to search commits fixing flaky tests by commit message. List of commit were analysed and categorised manually. In this publication they described 201 commits that likely fix flaky tests in 51 open-source projects.

Researchers split the root causes of flakiness into 10 categories. The top three categories of flaky tests are Async Wait, Concurrency, and Test Order Dependency. Most of flaky tests (78%) are flaky the first time they are written. Average number of days it takes to fix a test was 388.

More interesting question if if test was flaky or there was a bug in a code. Study shows that 24% of fixes were changing test and production code. Not every change was fixing a bug, in some cases simplification of production code was introduced along with fix for test code.

Conclusion

Flakiness in tests is caused by poor quality of test code or bug in production code. Tests can be rerun until success but it can smuggle a bug to production. The best solution is to invest time and fix flaky tests. It requires a lot of time and work but hopefully we have tools to help us. Sbt-flaky plugin is automating tasks previously run manually like running tests multiple times, group by failure result and correlate flakiness with new changes. It is not a silver bullet, it makes fixing process more approachable.

Unfortunately, there are no tools for automatic analysis and finding a root cause. Fixing bug requires human work.