Reproducing Flakes More Easily via Multiple Examples

Jeff Gaston
2 min readNov 14, 2023

--

A build failure that happens consistently is faster and easier to reproduce, study, and fix, than one that happens intermittently.

One approach for investigating intermittent failures (flakes) is to save their logs and provide a system where users can analyze these logs to look for patterns. In AndroidX, that’s often how we start our investigation.

We often find that the failure only happened once. The cost of fixing that kind of failure is likely to be higher than the cost of doing nothing, so we usually wait to see if it happens again (example).

For recurring failures, a common pattern is that the failure started (or increased in frequency) at a certain time. Those failures are usually triggered by a slightly earlier change. For example, we noticed a while ago that r8 was starting to run out of memory in our build. We’d also changed r8’s arguments before starting our investigation but after we started seeing out-of-memory errors. We had, however, updated Java slightly before the errors started to happen, which seems to have been the cause.

Sometimes a failure always involves the same tasks, which can point us to consider turning off the code triggering the error (example).

We even had a case where the build was always failing on the same computer. We ended up submitting a change to delete the offending file if the build failed, which we reverted shortly afterward once the corrupt file was deleted.

Another common occurrence is that the error is fixed now (example) and we don’t have to worry about it anymore.

Internally we mostly use our own service, but log searching also works in Gradle Enterprise too.

--

--