Build Stability

How we stopped rerunning failed tests and kept it that way

Written by Michael Tauraso.

Every developer at a company that practices continuous integration (CI) has the following conversation at some point about their builds:

Dev 1: “My build failed, but I didn’t change anything related to testSwissCheeseController
Dev 2: “Oh just re-run it. That test has been failing for everyone. It’s not your fault”

This is a frustrating exchange, especially when it occurs randomly and blocks important work. Automatic checking of code is only useful so far as the signal from it can be trusted. Unfortunately the sort of failure that these hypothetical developers describe is difficult to reproduce, diagnose, fix, and verify for exactly the same reason it’s infuriating: it doesn’t happen all the time.

There are many best practices you can use to reduce shared state in your tests, write better tests, and prevent these sorts of failures proactively. But when those techniques fail, what do you do?

Measuring Errors

CI builds attempt to measure whether application code is good code or not. Good code builds and passes tests. We can’t measure whether code is good code or not; we can only measure whether code passes tests once on a particular machine on a particular day.

We have two main types of measurement error: false negatives, where the build is red on good code, and false positives, where the build is green on bad code. Typically developers and build systems treat a green build as the “safe-to-merge” signal. Therefore it is possible for low-probability, false-negatives to accumulate on master and never get fixed. False negatives can also be measured by running tests on code that has at least one green build.

False positives cannot be measured as easily. Consider a test that always passes no matter the application code. It would be difficult to find this sort of test automatically. At Square, we use proactive techniques in our build scripts to avoid false positive errors from the CI system itself. Most notably, we use execution constructs in ruby and bash that ensure any process exiting abnormally is an immediate build failure unless handled explicitly.

Every night on our CI cluster, we re-run a successful build from the previous day to measure false positives. Using all of our build machines in parallel, we can get several hundred builds done while most developers are not working. Every morning, we count and triage the failures. This is a similar task to crash triage of a consumer facing application. Some of the failures are filed as bugs and assigned to the appropriate team depending on their nature and magnitude.

Error in Measurement

When we take data from a night of builds, we only have the number of times the build has run at a particular revision and the number of times it has failed. This failure percentage alone is occasionally misleading as to the nature of a bug.

The build team has a responsibility to maintain a 99% reliability SLA. We’ve noticed that when a build has <1% chance of failing, developer trust in the builds increases. Failures above the 1% level are very visible and frustrating — even in a group of a couple dozen developers.

Sometimes a 1% error won’t recur in 200 runs. Sometimes a 0.1% error will happen twice in the same number of builds. We have limited build machine resources and limited people available for triage and investigation. How do we use the failure percentage in order to allocate those resources well? We can set this up as a statistics problem.

Given the total number of builds and failures, we’re looking for a way to guess the range of the actual probability of failure. The binomial distribution describes the behavior of an unfair coin with probability P of getting heads. Stability check builds are an extremely unfair coin, as they typically have a success probability very close to 1.

We calculate a confidence interval about our measured probability using the Wilson score interval. The Wilson score interval is an extremal approximation of the Binomial confidence interval based off the normal distribution.

A good way to think about this is to consider a build that fails 1% of the time. We run it 200 times and get one failure. We might conclude therefore that the failure rate is 0.5%. This is a good guess, but we only have a single data point. If it fails again on the 201st trial, our guess will change dramatically. Likewise, if we see the same failure in the first 10 builds, we would guess that it happens 10% of the time. This is a reasonable guess, but another 10 trials with no additional failures would change our expectation significantly.

In the first example above, the 80% confidence interval from the Wilson calculation is 0.27%–1.54%. In the second example it is 5.5%–26.4%. These ranges show what we have actually found out given the limited number of trials better than the lone 0.5% or 10% number. The ranges quantify our current understanding of the failure as well as its magnitude.

This spreadsheet provides a visual representation of the probability model. The first sheet lets you input the number of builds and failures. The second shows the binomial distribution, as well as 90%, 80%, and 50% confidence intervals. The spreadsheet does a numerical integration, so the results are not exact and do not match the Wilson score calculation.

Implementation

At Square, we use the width of the confidence interval to decide whether to file a bug or not. If the confidence interval width is narrower than the observed error percentage for that category, we file a bug. If we observe a greater error percentage, we set the bug’s priority higher. We have calibrated thresholds on this process internally to make sure errors that affect the overall SLA are prioritized appropriately.

This technique is ideal to draw attention to errors of the build and test system that haven’t received attention, because they happen too infrequently. For things that happen often or failure modes that are well known, a proactive approach is more effective.

Making decisions based off the confidence interval was instrumental in squeezing the last bits of error out of the iOS CI system at Square. We found ways to configure our build slaves to avoid Xcode and simulator bugs. We found and fixed a rare memory mishandling bug affecting hundreds of our tests. We were also able to find several race conditions in new features before deploying Register to our manual testers.

At our most stable, we’ve measured one failure out of 1,500 builds in a week. Approaching 99.9% stability has massively increased the trust in the iOS CI system. We’re working on applying this approach and analysis to other areas where CI is practiced at Square.

This post is part of Square’s “Week of iOS” series.